Here we describe the developed system step by step. Firstly, we briefly describe the data sources and the image preprocessing techniques, and then we propose a SVM framework for the lip image classification in TCM by integrating color, texture and moment features into a pattern classification method. Before training the model, lip data is firstly normalized to have zero mean and unit variance. A feature selection method is then used to select a small set of effective features for classification in order to improve the generalization ability and the performance of the classifier. The data and the methodology are described next with more details.
Data description and preprocessing
We established the compatible face image acquisition system. It includes LED (the light source, color temperature value about 5600K, Ra = 90) and camera (the photographing medium). Its size is 36 cm × 40 cm × 28 cm. The other photographing conditions include distance 33 cm (between the camera and patients’ face), Tv (1/15 s), Av (5.6), ISO (80), white balance, custom mode and horizontal photography. The size of photographing windows was 220 mm x 170mm. As shown in Figure 2. The patients are selected in Longhua Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Shanghai Renji Hospital. This study has been approved by the Shanghai society of medical ethics. All the patients have signed the informed consent form. The cases with incomplete information are removed. Three senior chief TCM physicians performed lip diagnosis individually for the patients, and the data with consistent results between 2 physicians are recorded. The physicians performed lip diagnosis using the clinical classification scale facial diagnosis of TCM (see Additional file 1). Finally, a total of 257 cases are obtained in the study. Among the 257 patients, 132 patients are female (51.4%), 125 patients are male (48.6%), with mean age of (55.39 ± 14.28). In this article, we extract the color component, texture and moment features from 257 lip images. Our data includes 90 images of Deep-red, 12 images of Pale, 62 images of Purple and 93 images of Red.
Lip image segmentation
Accurate lip region segmentation is still a challenging and difficult problem due to the weak color contrast between lip region and non-lip region. To solve the problem, we developed a hybrid approach to automatically segment lip region with high precision based on Chan-Vese level set model and Otsu method, aimed to improve accuracy of automatic classification of lip diagnosis in TCM [10]. Firstly, mean-shift filter is used to smooth lip images. The difference of mean-shift filter and other filter methods is that mean-shift only smooth the same color region and not smooth the color change region. Secondly, a color space transformation that is fit for our lip segmentation is selected from some existing approaches. Lastly, level set image segmentation approach is used to get accurate lip region. An example of lip image segmentation is shown in the Figure 3. Experimental results on five hundred lip images show that the hybrid approach produces more accurate segmentation [11].
Extraction of images features
We chose a large number of features (84) for investigation which included color, moments and texture features (see Additional file 2), as explained next.
Color features
The color space is a mathematical method which represents color. Colors have different quantitative methods. In HSI color space, a color is represented by a three dimensional vector corresponding to a position. I component represents intensity. H component represents hue. S component represents saturation. In RGB color space, a color can be obtained by mixing three basic colors-Red, Green, Blue different proportions. R, G, B is the range of 0 ~ 255. In the YCbCr color space, Y represents luminance, Cb and Cr represents color information. In YIQ color space, Y represents gray information, I and Q represents color information.
The lip color is observed by TCM practitioner with nude eyes. Moreover, the lip color is mainly classified into Deep-red, Red, Purple and Pale etc. as shown Figure 4.
From intuition, the lip colors may be classifiable in RGB, HIS, YCbCr and YIQ color space. For instance, the colors “red” and “pale” are separable in the degree of intensity, and the colors “purple” and “deep red” are separable in the degree of Hue. So we select different color space as research content to extract feature.
In spite of the fact that the color histogram technique is a very simple and low-level method, it has shown good performance in practice especially for image indexing, retrieval and classification tasks, where feature extraction has to be as fast as possible.
The histogram features that we will consider are statistics based features. The extracting scheme is described as follows: 1) Lip images have been segmented from face images. It can determine the export lip area by removing the background. We can easily distinguish between the background and the lips because their colors are very different. When the background is white, which has the saturation value of 0.0, we can judge the value. In our experiment, a pixel is determined to be in the exit lip region if its saturation value is larger than 0.1. 2) To covert RGB representation of image to HIS, YcbCr, YIQ representation using different methods in image processing [12], we transform the Hue value from 0 to 360 degrees into interval from 0 to 1, to make the values in its components of HSI color space falling into range from 0 to 1. 3) The mean and variance of R, G, B, H, I, S, Y, Cb, Cr, Y, I, Q components are calculated in color space. After the above steps, we finally get 24 color component quantification values from the lip image.
Haralick co-occurrence features
As a traditional image feature extraction technique, the Haralick co-occurrence features use co-occurrence distribution of the gray image to generate the texture signature. Roughly speaking, given an offset on the image, the co-occurrence distribution of the image is referred to as the distribution of co-occurring values of pixel intensities. In our method, the Haralick co-occurrence features were extracted from each of the gray level spatial-dependence matrices [13]. The extracted 13 co-occurrence features were as follows: angular second moment, contrast, correlation, sum of squares, inverse difference moment, sum average, sum variance, sum entropy, entropy, difference variance, difference entropy, information measures of correlation, and maximal correlation coefficient [14].
Zernike moment features
Simply speaking, the Zernike moments features of an image are calculated based on the particular weighted averages of the intensity values. They are generated with the basic functions of Zernike polynomials. As classical image features, Zernike moments have wide applications [15]. Here, we give a brief description for calculating Zernike moments features for each lip image. First, calculate the center of mass for each lip image and redefine the lip pixels based on this center. Second, compute the radius for each lip, and define the average of the radii as R. Third, map the pixel of the lip image to a unit circle and obtain the projected pixel as . Since the Zerike moments polynomials are defined over a circle of radius 1, only the pixels within the unit circle will be used to calculate Zerike moments. Let denotes the florescence of the pixel. The Zerike moment for each lip image is defined as:
Where , , is even, and can be calculated as:
Finally, 47 moments features are obtained from the lip image.
Feature selection
SVM-RFE
First the number of features is reduced by eliminating the less relevant features using a forward selection method based on a ranking criterion and then backward feature elimination is applied using a feature subset selection method, as explained next.
Features subset selection method based on SVM
The support vector machine recursive feature elimination (SVM-RFE) algorithm [16] is applied to find a subset of features that optimizes the performance of the classifier. This algorithm determines the ranking of the features based on a backward sequential selection method that removes one feature at a time. At each time, the removed feature makes the variation of SVM-based leave-one-out error bound smallest, compared to removing other features.
To remove the irrelevant features and improve the performance of the learning system, a prediction risk-based feature selection method is employed to choose the sub-optimal feature sets [16]. This method employs an embedded feature selection criterion of prediction risk, which evaluates features by calculating the change if the corresponding feature is replaced by its average value. It has several advantages. (1) The embedded feature selection model depends on learning machines. It can reach higher accuracy than the filter model, as well as keep the lower computation complexity than the wrapper model. (2) The prediction risk criterion had been employed with several different learning machines and outperformed Optimal Brain Damage when using multi-class SVM to test on more than 10 University of California Irvine (UCI) datasets [17].
mRMR
Most used filter type methods is to simply select top-ranked features without considering relationships among features. The minimum Redundancy Maximum Relevance (mRMR) algorithm [18] is also applied not only to realize max-dependency, but also to consider eliminating the redundancy features. The optimization of scoring functions are all based on mutual-information.
IG
Information gain (IG) is an alternative synonym for Kullback–Leibler divergence, which is frequently employed in the field of machine learning, especially in text categorization [19]. It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document. Through IG algorithm for feature selection, the data must be discretized first before the experiment.
Classification is performed by following 50 times 10-fold cross validation on the training samples. Feature ranking based leave-one-out is performed using data. The feature selection method is implemented in each training subset in order to correct for the selection bias [20]. It is important that cross-validation is external to the feature selection process in order to more accurately estimate the prediction error. We combine the rankings of all leave-one-out experiments and report the total rank of features (in Figure 5 and Additional file 3).
Classification
In recent years, many novel algorithms are proposed or improved based on typical classifiers. Just as the literature [21] proposed a quick SVM-based multi-class classifier, and proved that is effective and fast than some typical approaches such as one-versus-all and one-versus-one. To improve the performance of kNN, a k-NS classifier [22] is designed based on the distance from a query sample to the nearest subspaces, which is spanned by k nearest samples of the same class. Although the state-of-art classifiers are attractive, we still choose the typical SVM one-versus-one classifier and kNN due to the following reasons: (1) The lip image dataset we designed is a small-scale set, so the algorithm time-consuming problem is not our main consideration. (2) Most TCM frameworks adopt these typical classifiers for classification, yet with extracting different types of features [8]. Thus the algorithms in our framework need to be compared on similar classifiers and different features. That will support our opinions and arguments with the extracted three features which are more effective in this paper. However, it is extremely significant to consider these state-of-art methods in our future work which may evolve into a large-scale problem and bring in a novel framework concerned more about the real-time problem.
SVM
Classification is performed by starting with the more discriminative features and gradually adding less discriminative features, until classification performance no longer improves. Support Vector Machines [23, 24] with poly kernel is used as classifier. We map the input vectors into a feature space, either linearly or non-linearly, which is relevant to the selection of the kernel function. Then, within the feature space, seek an optimized linear division; i.e. construct a hyper-plane that can separate the entire samples into two classes with the least errors and maximal margin. The SVM training process always seeks a global optimized solution and avoids over-fitting, so it has the ability to deal with a large number of features.
Weighted-SVM (WSVM)
Since the data are highly unbalanced and the sample size is rather small to produce balanced classes by sub sampling the largest class, we used a weighted SVM [25–27] to apply larger penalty to the class with the smaller number of samples. If the penalty parameter is not weighted (equal C for both classes), there is an undesirable bias towards the class with the large training size; and thus we set the ratio of penalties for different classes to the inverse ratio of the training class sizes.
Multi-class SVM
The multi-class problem is solved by constructing and combining several binary SVM classifiers into a voting scheme. We apply majority voting from one-versus-one classification problems. The predictive ability of the classification scheme is assessed by 10-fold cross validation.
MAPLSC
The Multiple Asymmetric Partial Least Squares Classifier (MAPLSC) [28] is an extension of classifier APLSC to the multi-class problem, and APLSC is an asymmetric PLS classifier, which sophisticatedly researches into the imbalanced distribution between classes. MAPLSC adopted the pairwise coupling strategy for combining the probabilistic outputs of all the one-versus-one binary classifiers to obtain estimates of the posterior probabilities for all candidate classes.
As a performance comparison, we also consider the Naïve Bayes classifier and the kNN classifier.
A classifier can provide a criterion to evaluate the discrimination power of the features for the feature subset selection. A kNN classifier is chosen for its simplicity and flexibility. Each sample is represented as a vector in an n-dimension feature space. The distance between two samples is defined by the Euclidian distance. A training set is used to determine the class of a previously unseen nucleus. The classifier calculated the distances between one sample and the others in the training set. Next, the K samples in the training set which are the closest to the unseen samples are selected. The class label of this sample is determined to be the class of the most common sample type in the K nearest neighbors. For Naïve Bayes classifier, it is a simple probabilistic classifier based on the Bayes rule and assumes that feature variables are unrelated to any other features. Despite the conditional independence assumptions, Naïve Bayes classifiers also have good classification performance for many complex real-world datasets.
All the above classifiers are realized in MATLAB toolboxes, SVM-KM [23] and LIBSVM [27]. Simultaneously, they can also be carried out with other libraries, such as LIBLINER [29], which is alternative for SVM classification, especially appropriate for large-scale classification problems.