Wine classification

The Wine Dataset

    
     The wine recognition dataset is located at the UCI Repository of Machine Learning Databases. It was created and donated by Stefan Aeberhard in 1991, S. Aeberhard et al (1992), used this dataset to compare of classifiers in high dimensional settings, also the dataset was used with many others for comparing various classifiers.


     
     
     These wine dataset are derived from the results of a chemical analysis of wines grown in Italy by three different types of grapes [8]. The analysis verifies the amount of 13 chemical substances found in each of the three types of wines. The wine data consists of a total of 178 samples for three classes, of 59 instances in class 1, 71 in class 2, and, 48 in class 3, this means that the classes are not well separated.  In addition, it has 13 continuous attributes from three classes which are stated as follow: Alcohol, Malic acid, Ash, Alcalinity of ash, Magnesium, Total phenols, Flavanoids, Nonflavanoid phenols, Proanthocyanins, Color intensity, Hue, OD280/OD315 of diluted wines, and Proline. There is no missing attribute value for this dataset [9]. The wines are classified into class1, class 2 or class 3 types. Each instance or attribute in each colomn belongs to either class 1, 2 or 3. From the scaled data given in the table below, the sample belong to class 1 which means it is a good quality wine. Each sample shows the percentage constituent of the thirteen attributes found in each of the three types of wine. Class 1 is represented by α , Class 2 is represented by β, and class 3 is represented by γ [2][3].
Table 1: Percentage composition of three different samples for the three classes

Samples Attribute
Class = Types of wine
α
β
γ
Alcohol
0.678948
1:-0.152631
-0.0368422
Malic acid   
0.284585
-0.754941
-0.758893
Ash
0.229946
-0.294118
0.0267379
Alcalinity of ash 
-0.731959
-0.360825
-0.237113
Magnesium
0.26087
-0.347826
0.130435

Total phenols
0.393103
-0.282759
-0.634483
Flavanoids
0.139241
-0.548523
-0.616034
Nonflavanoid phenols
0.735849
0.509434
-0.698113
Proanthocyanins
0.0536277
9:-0.867508
-0.665615
Color intensity                
0.348123
10:-0.237201
-0.518771
Hue
-0.333333
-0.186992
-0.544715
OD280/OD315 of diluted wines
0.655678
-0.765568
-0.985348
Proline            
-0.312411
-0.754636
-0.49786


The table 1 above shows the percentage composition of three different scaled samples in the different classes respectively. In other to distinguish wine in class 1 from the one in class two, the feature mask table helps to explain the important features for differentiation.
Table 2: Feature mask table
Classes
Feature mask/Attributes
N0
1
2
3
4
5
6
7
8
9
10
11
12
13
1
0
1
0
1
1
1
1
0
1
0
1
1
0
2
0
0
1
1
1
1
1
0
0
1
1
1
1
3
0
0
1
1
0
0
0
0
0
0
1
1
1
     
     From the feature mask table, the feature subset {2, 4, 5, 6, 7, 9, 11, 12} is important for distinguishing class 1 from other classes, feature subset {3, 4, 5, 6, 7, 10, 11, 12, 13} is used to distinguish class 2 from the other two classes and the feature subset {2, 3, 11, 12, 13} is used to discriminate class 3 from other classes. The determinant power for distinguishing the different classes is found in the subset of each class. Different feature subset are selected for different classes depending on its ability to discriminate one class from the other, this helps to provide a new direction for analysing the relationship between features and classes [1].

LIBSVM version 2.92 was used to perform the training on the wine dataset, testing and prediction. In SVM train, svm-train with (-s 0) which is the default setup type of the SVM was used and  (-t 0) which represents the radial base function kernel option parameter.
The radial base function kernel requires the "gamma" and "C", these two are obtained from the cross validation table (Table 3) below. The table 4 is a zoomed table used to get a better gamma and C values.
(Table 3) The cross-validation accuracy table for wine dataset

(Table 4) The cross-validation zoomed in accuracy table


After the model was obtained from the trained data set examples, testing was performed by applying the tested data set to the model / classifier. The classification file result and it's accuracy value was obtained.
Related links :


Search This Blog