TCMEval-SDT: a benchmark dataset for syndrome differentiation thought of traditional Chinese medicine

In this study, the medical records were processed by TCM-Experts, ensuring that all medical records underwent anonymization. A rigorous quality assurance process was implemented to ensure the privacy, accuracy, and reliability of the collected medical records. Subsequently, 300 medical records were selected through manual screening. These records were annotated using Baibu Knowledge Engine^14,15, a corpus Tool in the field of TCM that supports automatic annotation, human-machine combined annotation, and manual annotation modes for entity and relation annotation, to construct a comprehensive and systematically organised dataset for TCM syndrome diagnosis.

Table of Contents

Data collection

The medical records were sourced from a self-built database established by our team, curated by experts from the Institute of Information on Traditional Chinese Medicine-China Academy of Chinese Medical Sciences, the Institute of Basic Theory for Chinese Medicine-China Academy of Chinese Medical Sciences, and senior TCM students. The data were collected from diverse sources, such as the China National Knowledge Infrastructure (CNKI, Wanfang data ( classical Chinese medical texts and medical records from hospitals.

The data were first screened by TCM experts according to the following standards: (1) Complete medical record, including information such as clinical data and clinical experience, etc.; (2) Cases of common diseases. Cases of rare diseases and duplicate cases were excluded. To evaluate the quality of TCM medical records, we developed a TCM Medical Record Quality Assessment Scale (as shown in Table 1) based on the CARE guidelines and TCM expert opinions. This scale comprises ten sub-items, including patient information, clinical findings, timeline, and diagnostic evaluation, to systematically assess the quality of TCM case data. Evaluation results are categorized as “clearly described” “not clearly described” and “ not described” with corresponding scores of 1, 0.5, and 0, respectively^16,17. The TCM expert group assessed the quality of the manually screened cases using this scale, excluding cases with scores lower than 6 and including those with scores of 6 or higher.

Table 1 Details of TCM Medical Record Quality Assessment Scale.

Data pre-processing and anonymization

The preprocessing workflow for the medical records is shown in Fig. 1. The first step involves anonymizing each medical record by permanently removing identifiable information, such as patient ID and name, to protect patient privacy. The second step entails cleaning and organizing the data by removing duplicate or null data and standardizing the medical records. The FAIR principles serve as foundational guidelines for data sharing and reuse. To support these goals, we designed metadata for medical records in our study that comply with the FAIR principles. We shared the metadata of the TCMEval-SDT dataset on the CDE Portal ( a public metadata registration and management platform, to facilitate the design and management of metadata for similar future projects (as shown in Table 2). We organized unstructured data, including TXT, PDF, Word, and HTML files, into structured data according to metadata requirements, and then assigned a unique identifier to each medical record. Finally, we constructed a benchmark database for syndrome diagnosis, named TCMEval-SDT.

Table 2 FAIR-compliant metadata of medical record in TCMEval-SDT dataset.

Data selection and annotation

The diagnosis of syndromes in TCM is inherently multidimensional, involving a comprehensive evaluation of the interactions between a patient’s physiological, pathological, and environmental factors. For theoretical analysis and practical guidance, we have summarized the TCM syndrome diagnosis process into four steps, as illustrated in Fig. 2.

(1)

Clinical Information Extraction: emulating TCM clinicians in obtaining clinical information from the patient’s medical data.
(2)

Pathogenesis Reasoning: Inferring TCM pathogenesis from relevant clinical information.
(3)

Syndrome Reasoning: Inferring TCM syndromes from relevant TCM pathogenesis.
(4)

Explanatory Summary: Summarizing clinical experiences and insights from TCM clinicians.

Entity and relation for medical record

We selected 300 medical records and employed the Baibu Knowledge Engine to annotate them according to the aforementioned steps. The annotated entities and their relations are shown in Tables 3, 4.

Table 3 Annotated entity of medical records.

Table 4 Annotated relation of entity in medical records.

Annotation guidelines

(1)

We classified the clinical information into two types: relevant information and irrelevant information. Relevant information refers to critical clinical information that significantly influences the diagnostic process, while irrelevant information refers to clinical information that does not impact the diagnosis. The annotated entities include only the relevant information in the TCM syndrome diagnosis process. For example, belching (clinical information) – stomach qi upward (TCM pathogenesis) – liver and stomach disharmony (syndrome). Irrelevant information, such as “red tongue with white coating” is excluded from the annotation scope as it does not directly influence this diagnostic process.
(2)

It is essential that the annotated entities must be as comprehensive as possible. For example, in “painful distension behind the sternum and in the epigastric region”, the entire phrase must be annotated to prevent loss of critical information by annotating only “painful”.
(3)

Inferential relationships exist between clinical information and TCM pathogenesis, and also between TCM pathogenesis and TCM syndromes. For example, extracting clinical information such as “belching” and “depressed state” leads to the inference of TCM pathogenesis, including “stomach qi upward” and “liver-qi stagnation”. Integrating these pathogenic indicators results in the identification of TCM syndromes like “liver and stomach disharmony”.
(4)

In this study, the annotation task adheres to a specific rule for long mentions where multiple entities are connected: each entity with independent significance is annotated separately. For example, in the phrase “painful distension behind the sternum and in the epigastric region, burning sensation behind the sternum, sensation of obstruction when swallowing, accompanied by belching and nausea”, the annotation was conducted as follows: “painful distension behind the sternum and in the epigastric region”, “burning sensation behind the sternum”, “sensation of obstruction when swallowing” accompanied by “belching” and “nausea”. This approach ensures that each meaningful entity is properly annotated based on its individual significance.

Example of clinical records annotation through the Baibu Knowledge Engine

Figure 3 illustrates an example of a TCM record annotated using the Baibu Knowledge Engine. TCM experts annotate the clinical Information, TCM pathogenesis, TCM syndrome, and its relations.

Example of the thought process design in syndrome differentiation

Figure 4 illustrates the detailed design of the thought process in syndrome differentiation. TCM experts extract clinical information and infer TCM pathogenesis based on the clinical data. The inferred pathogenesis is then used to deduce the corresponding syndromes. This process emulates the specific reasoning steps employed by TCM clinicians during syndrome differentiation, providing AI algorithms and models with detailed steps to emulate this reasoning process.

Data evaluation

After the data annotation process was completed, a quality assessment was performed on the 300 medical records used in this study. Each medical record was thoroughly annotated to ensure the completeness and accuracy of the case information. Additionally, to reduce potential biases introduced by incomplete information, all data records were required to contain no missing values. Finally, to maintain the representativeness of the sample, rare medical records were excluded. The final statistics of all TCM medical records, classified according to the ICD-11 for Mortality and Morbidity Statistics ( are shown in Table 5. All 300 annotated medical records satisfied the aforementioned selection criteria.

Table 5 Statistics of 300 TCM medical records classified by disease according to the ICD-11.

link

TCMEval-SDT: a benchmark dataset for syndrome differentiation thought of traditional Chinese medicine

Data collection

Data pre-processing and anonymization

Data selection and annotation

Entity and relation for medical record

Annotation guidelines

Example of clinical records annotation through the Baibu Knowledge Engine

Example of the thought process design in syndrome differentiation

Data evaluation

More Stories

Did James Van Der Beek Really Die Broke?

Geneva Expertise in Medicinal and Aromatic Plants

China’s stressed youth mix traditional medicine and cocktails

Leave a Reply Cancel reply

Teladoc Health Appoints Michael Smith, Experienced Insurance and Financial Services Executive, to Its Board of Directors

What $50 Billion for U.S. Foreign Affairs Changes for Global Health

Did James Van Der Beek Really Die Broke?

Without subsidies, Florida families cringe over health insurance bills

Data collection

Data pre-processing and anonymization

Data selection and annotation

Entity and relation for medical record

Annotation guidelines

Example of clinical records annotation through the Baibu Knowledge Engine

Example of the thought process design in syndrome differentiation

Data evaluation

More Stories

Did James Van Der Beek Really Die Broke?

Geneva Expertise in Medicinal and Aromatic Plants

China’s stressed youth mix traditional medicine and cocktails

Leave a Reply Cancel reply

You may have missed

Teladoc Health Appoints Michael Smith, Experienced Insurance and Financial Services Executive, to Its Board of Directors

What $50 Billion for U.S. Foreign Affairs Changes for Global Health

Did James Van Der Beek Really Die Broke?

Without subsidies, Florida families cringe over health insurance bills