TCMEval-SDT: a benchmark dataset for syndrome differentiation thought of traditional Chinese medicine

In this study, the medical records were processed by TCM-Experts, ensuring that all medical records underwent anonymization. A rigorous quality assurance process was implemented to ensure the privacy, accuracy, and reliability of the collected medical records. Subsequently, 300 medical records were selected through manual screening. These records were annotated using Baibu Knowledge Engine14,15, a corpus Tool in the field of TCM that supports automatic annotation, human-machine combined annotation, and manual annotation modes for entity and relation annotation, to construct a comprehensive and systematically organised dataset for TCM syndrome diagnosis.
Data collection
The medical records were sourced from a self-built database established by our team, curated by experts from the Institute of Information on Traditional Chinese Medicine-China Academy of Chinese Medical Sciences, the Institute of Basic Theory for Chinese Medicine-China Academy of Chinese Medical Sciences, and senior TCM students. The data were collected from diverse sources, such as the China National Knowledge Infrastructure (CNKI, Wanfang data ( classical Chinese medical texts and medical records from hospitals.
The data were first screened by TCM experts according to the following standards: (1) Complete medical record, including information such as clinical data and clinical experience, etc.; (2) Cases of common diseases. Cases of rare diseases and duplicate cases were excluded. To evaluate the quality of TCM medical records, we developed a TCM Medical Record Quality Assessment Scale (as shown in Table 1) based on the CARE guidelines and TCM expert opinions. This scale comprises ten sub-items, including patient information, clinical findings, timeline, and diagnostic evaluation, to systematically assess the quality of TCM case data. Evaluation results are categorized as “clearly described” “not clearly described” and “ not described” with corresponding scores of 1, 0.5, and 0, respectively16,17. The TCM expert group assessed the quality of the manually screened cases using this scale, excluding cases with scores lower than 6 and including those with scores of 6 or higher.
Data pre-processing and anonymization
The preprocessing workflow for the medical records is shown in Fig. 1. The first step involves anonymizing each medical record by permanently removing identifiable information, such as patient ID and name, to protect patient privacy. The second step entails cleaning and organizing the data by removing duplicate or null data and standardizing the medical records. The FAIR principles serve as foundational guidelines for data sharing and reuse. To support these goals, we designed metadata for medical records in our study that comply with the FAIR principles. We shared the metadata of the TCMEval-SDT dataset on the CDE Portal ( a public metadata registration and management platform, to facilitate the design and management of metadata for similar future projects (as shown in Table 2). We organized unstructured data, including TXT, PDF, Word, and HTML files, into structured data according to metadata requirements, and then assigned a unique identifier to each medical record. Finally, we constructed a benchmark database for syndrome diagnosis, named TCMEval-SDT.

Overview of the data processing workflow and evaluation for the TCMEval-SDT benchmark dataset. The TCM syndrome diagnosis cases sourced from the internet, classical Chinese medical texts, and hospital medical records. The original medical records underwent data preprocessing, including data cleaning, anonymization, and the removal of duplicates, before being stored in a database. From this database, 300 cases meeting specific criteria, such as non-rare cases, were selected. These cases were then annotated and curated by TCM experts using the Baibu knowledge engine. Finally, validation was performed using publicly available LLMs, including GLM-130B, Tongyi Qianwen, ChatGPT, and Gemini 1.5 Pro. Note. TCM = traditional Chinese medical; LLMs = large language models.
Data selection and annotation
The diagnosis of syndromes in TCM is inherently multidimensional, involving a comprehensive evaluation of the interactions between a patient’s physiological, pathological, and environmental factors. For theoretical analysis and practical guidance, we have summarized the TCM syndrome diagnosis process into four steps, as illustrated in Fig. 2.
-
(1)
Clinical Information Extraction: emulating TCM clinicians in obtaining clinical information from the patient’s medical data.
-
(2)
Pathogenesis Reasoning: Inferring TCM pathogenesis from relevant clinical information.
-
(3)
Syndrome Reasoning: Inferring TCM syndromes from relevant TCM pathogenesis.
-
(4)
Explanatory Summary: Summarizing clinical experiences and insights from TCM clinicians.

Key steps for syndrome diagnosis of TCM. The figure illustrates the four key steps in TCM syndrome diagnosis. On the left side, the processed clinical data is shown, including the patient’s demographic information, chief complaint, medical history, and physical examination. First step, through recognition and extraction, the patient’s clinical information is obtained. Based on this clinical information, the corresponding TCM pathogenesis is inferred. Then, the TCM pathogenesis is used to inferred the relevant TCM syndromes. Finally, an explanatory summary is provided, emulating the process TCM clinicians follow for syndrome diagnosis. Note. TCM = traditional Chinese medical.
Entity and relation for medical record
We selected 300 medical records and employed the Baibu Knowledge Engine to annotate them according to the aforementioned steps. The annotated entities and their relations are shown in Tables 3, 4.
Annotation guidelines
-
(1)
We classified the clinical information into two types: relevant information and irrelevant information. Relevant information refers to critical clinical information that significantly influences the diagnostic process, while irrelevant information refers to clinical information that does not impact the diagnosis. The annotated entities include only the relevant information in the TCM syndrome diagnosis process. For example, belching (clinical information) – stomach qi upward (TCM pathogenesis) – liver and stomach disharmony (syndrome). Irrelevant information, such as “red tongue with white coating” is excluded from the annotation scope as it does not directly influence this diagnostic process.
-
(2)
It is essential that the annotated entities must be as comprehensive as possible. For example, in “painful distension behind the sternum and in the epigastric region”, the entire phrase must be annotated to prevent loss of critical information by annotating only “painful”.
-
(3)
Inferential relationships exist between clinical information and TCM pathogenesis, and also between TCM pathogenesis and TCM syndromes. For example, extracting clinical information such as “belching” and “depressed state” leads to the inference of TCM pathogenesis, including “stomach qi upward” and “liver-qi stagnation”. Integrating these pathogenic indicators results in the identification of TCM syndromes like “liver and stomach disharmony”.
-
(4)
In this study, the annotation task adheres to a specific rule for long mentions where multiple entities are connected: each entity with independent significance is annotated separately. For example, in the phrase “painful distension behind the sternum and in the epigastric region, burning sensation behind the sternum, sensation of obstruction when swallowing, accompanied by belching and nausea”, the annotation was conducted as follows: “painful distension behind the sternum and in the epigastric region”, “burning sensation behind the sternum”, “sensation of obstruction when swallowing” accompanied by “belching” and “nausea”. This approach ensures that each meaningful entity is properly annotated based on its individual significance.
Example of clinical records annotation through the Baibu Knowledge Engine
Figure 3 illustrates an example of a TCM record annotated using the Baibu Knowledge Engine. TCM experts annotate the clinical Information, TCM pathogenesis, TCM syndrome, and its relations.

Example of annotation for TCM clinical records. Note. TCM = traditional Chinese medical.
Example of the thought process design in syndrome differentiation
Figure 4 illustrates the detailed design of the thought process in syndrome differentiation. TCM experts extract clinical information and infer TCM pathogenesis based on the clinical data. The inferred pathogenesis is then used to deduce the corresponding syndromes. This process emulates the specific reasoning steps employed by TCM clinicians during syndrome differentiation, providing AI algorithms and models with detailed steps to emulate this reasoning process.

Example of the thought process design in syndrome differentiation. The left side of the figure shows the patient’s clinical data. Based on this data, TCM experts annotate and provide specific guided reasoning steps for the algorithms or models on the right side. Algorithms or models can follow these steps in a step-by-step reasoning process, thereby emulating the detailed procedure by TCM clinicians in syndrome diagnosis. Note. TCM = traditional Chinese medical.
Data evaluation
After the data annotation process was completed, a quality assessment was performed on the 300 medical records used in this study. Each medical record was thoroughly annotated to ensure the completeness and accuracy of the case information. Additionally, to reduce potential biases introduced by incomplete information, all data records were required to contain no missing values. Finally, to maintain the representativeness of the sample, rare medical records were excluded. The final statistics of all TCM medical records, classified according to the ICD-11 for Mortality and Morbidity Statistics ( are shown in Table 5. All 300 annotated medical records satisfied the aforementioned selection criteria.
link