Data set of coronary heart disease in TCM
In this paper, a heart system inquiry diagnosis scale [14] is designed, in which the symptoms are defined clearly, and the detailed collecting methods are listed. The scale is shown in [Additional file 1].
Inclusion criteria of the patients are: 1) The patients who meet the diagnostic criteria of CHD; 2) The patients who are informed consented.
Diagnosis criteria of the patients are in western medicine and TCM. Diagnosis criteria in western medicine are by referring to "Naming and diagnosis criteria of ischemic heart disease" issued by International Society of Cardiology and the Joint Subject Team on standardization of clinical naming in World Health Organization [15]. CHD is defined as a stenosis greater than 50% in at least one of 16 segments of the 3 major coronary arteries and their branches by coronary angiography. Diagnosis criteria in TCM are according to the "Differentiation standards for symptoms and signs of Coronary Heart Disease and Angina Pectoris in Traditional Chinese Medicine" in the "Standards for differentiation of chest pain, chest distress, palpitation, short breath or debilitation for coronary heart disease in Traditional Chinese Medicine" modified by China Society of Integrated Traditional Chinese and Western Medicine in 1990 and the "Guideline for Clinical study of new drugs in Chinese herbs", and the standards in textbooks [16, 17]. After discussion with experts in cardiology, the diagnosis criteria are established.
Exclusion criteria are 1) The patients with mental diseases or with other severe diseases; 2) The patients who could not express their feeling clearly; 3) The patients who refused to participate in our study or without informed consent.
The patients with coronary heart disease are selected in Cardiology Department of Longhua Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Shanghai Renji Hospital and Shanghai Hospital of Chinese Medicine. The cases with incomplete information or inconformity with the diagnosis criteria of CHD are removed. This work has been approved by the Shanghai society of medical ethics. All the patients have signed the informed consent form. Finally, a total of 555 cases are obtained in the study.
Three senior chief TCM physicians performed diagnosis individually for the 555 cases by referring to the diagnosis criteria established in the study, and the data with consistent results between 2 physicians are recorded; as for the inconsistent results, the data is not recorded until the result is consistent after discussion with the other experts.
Among the 555 patients, 265 patients are male (47.7%, with mean age of 65.15 +/- 13.17), and 290 patients are female (52.3%, with mean age of 65.24 +/- 13.82). The symptoms collected for inquiry diagnosis include 8 dimensions: cold or warm, sweating, head, body, chest and abdomen, urine and stool, appetite, sleeping, mood, and gynaecology, a total of 125 symptoms. There are 15 syndromes in differentiation diagnosis, of which 6 commonly-used patterns are selected in our study, including: z1 Deficiency of heart qi syndrome; z2 Deficiency of heart yang syndrome; z3 Deficiency of heart yin syndrome; z4 Qi stagnation syndrome; z5 Turbid phlegm syndrome and z6 Blood stasis syndrome.
The data set is shown in [Additional file 2].
Computational methods
In our study, we construct models of the relationship between symptoms and syndromes of inquiry diagnosis by means of the multi-label k-nearest neighbour (ML-kNN) algorithm. Then the results are compared with those calculated by the classical k-nearest neighbour (kNN) algorithm.
kNN is an algorithm whose idea is to search for the nearest point in training data set [18]. This theory regards an instance as a point in synthesis space; thus, the label of a test instance is probably similar to those of several nearest points. Based on this theory, the algorithm of kNN is to search for k train instances nearest to the test instance, then according to their labels, to predict the label of the test instance. Compared with other mining algorithms, the advantage of kNN lies in simpler training process, better efficiency and forecast accuracy.
In this paper, the data set for differentiation diagnosis of CHD belongs to multi-label; whereas kNN only processes single label data sets, so the collected data set should be split into many groups of single label to be calculated. The modified algorithm is shown as below.
Step 1: Split the data set with n labels into n data sets with single label.
Step 2: Presuming there are p samples in the test instance, for each instance xi, calculate the distances of xi to all training instances; for each instance xi of test data, find the k instances of training data with the smallest distance.
Step 3: According to the labels of k instances, forecast the label of each test instance xi (in this paper, the result is judged as positive if the number of labels is more than k/2, otherwise, negative). Then the forecast results of p test instances are obtained.
Step 4: Combine n groups of results to obtain the forecast result of multiple labels and assess the forecast results according to multiple label evaluation criteria.
In clinical practice, there are many multi-label problems similar to the modelling for symptoms of inquiry diagnosis in our paper. If a single label algorithm is used, the multiple labels are usually divided and then calculated. In the multi-label data, there is much relationship among each label, so simple splitting inevitably result in data loss. For this reason, multi-label learning algorithms are developed so as to better reveal the correlation of the labels, of which multi-label kNN (ML-kNN) is a popular technique [19, 20]. ML-kNN is a lazy multi-label learning algorithm developed on the basis of kNN. Based on the theory of kNN, ML-kNN aims to find k nearest instances for each test instance. In ML-kNN, the labels of test instances are judged directly by nearest instances, which is different from kNN. The algorithm is shown as below.
Step 1: Calculate the conditional probability distribution of each instance associated to each label;
Step 2: Calculate the distance between the xi test instance and the training instances; then find k nearest instances for xi. Repeat for each test instance.
Step 3: According to the labels of k training instances and the conditional probability associated to each label, forecast the probability of the xi instance and then acquire the forecast results (here ≥ 0.5 is taken); Repeat for each test instance.
Step 4: Evaluate the forecast results according to multi label evaluation criteria.
Codes of both algorithms of ML-kNN and kNN are implemented on the MATLAB platform, which are shown in [Additional file 3].
To make the computational results more fruitful, the results obtained by ML-kNN and kNN are compared with those of two other multi-label learning algorithms, RankSVM [21] and BPMLL [22]. We used the default and experienced parameters values in RankSVM and BPMLL. For RankSVM, the number of hidden neurons is 8, the max training epochs is 6. For BPMLL, the number of max iteration is 10. Parameters not mentioned were set to the default values [21, 22].
Experimental design and evaluation
In the CHD data set, 90% of the samples are randomized as the training set and the other 10% are as the test set. The forecast analysis of models for syndromes of inquiry diagnosis in TCM is performed after re-testing the models for 50 times and taking the mean value. Then different values of k are chosen to evaluate its influence on kNN and ML-kNN. In this paper, k is chosen from {1, 3, 5, 7, 9, 11}. According to the frequency of symptoms, the symptom features are removed as follows: the symptoms with frequencies of {≤ 10, ≤ 20, ≤ 40, ≤ 70, ≤ 100, ≤ 150, ≤ 200 and ≤ 400} are removed in turn, thus the symptom subsets with 150, 106, 83, 64, 52, 32 and 21 are obtained. Models of inquiry diagnosis are constructed on the basis of symptom subsets, and the influence of symptom selection on forecast model of inquiry diagnosis is investigated.
Let X denote the domain of instances and let Y = {1,2,⋯,Q} be the finite set of labels. Given a training set T = {(x1, Y1), (x2, Y2), ⋯, (xm, Ym)}(xi ∈ X, Yi ⊆ Y), the goal of the learning system is to output a multi-label classifier h: X→2ywhich optimizes some specific evaluation metric. It is suppose that, given an instance xi and its associated label set Yi, a successful learning system will tend to output larger values for labels in Yi than those not in Yi, i.e. f(xi, y1) > f(xi, y2) for any y1 ∈ Yi and y2 ∉ Yi.
The real-valued function f(·,·) can be transformed to a ranking function rankf(·,·), which maps the output of f(xi, y) for any y ∈ Y to {1,2,⋯,Q} such that if f(xi, y1) > f(xi, y2) then rankf(xi, y1) < rankf(xi, y2).
To evaluate the forecast results of ML-kNN comparing with traditional kNN, the following criteria [19] are used:
The performance is perfect when it is 1; the bigger the value of average precision, the better the performance of classifiers.
It is loosely related to precision at the level of perfect recall. The smaller the value of coverage, the better the performance.
where z means
denotes the complementary set of Y in Y.
The performance is perfect when it is 0; the smaller the value of ranking loss, the better the performance.