Kernel Selection of SVM for Commerce Image Classification

: Content-based image classification refers to associating a given image to a predefined class merely according to the visual information contained in the image. In this study, we employ SVM (Support Vector Machine) and presented a few kernels speciﬁcally designed to deal with the problem of content-based image classiﬁcation. Several common kernel functions are compared for commerce image classification with the PHOW (Pyramid Histogram of visual Words) descriptors. The experiment results illustrate that chi-square kernel and histogram intersection kernel are more effective with the histogram based image descriptor for commerce image classification.


INTRODUCTION
Support vector machine is a popular discriminative learning method with the advantages of providing a good out-of-sample generalization and less need of prior knowledge about the problem.In this study we employed SVM and presented a few kernels specifically designed to deal with the problem of content-based commerce image classification.The socalled content-based image classification refers to associating a given image to a predefined class merely according to the visual information contained in the image (Kannan et al., 2011;Wang et al., 2011;Perronnin et al., 2010;Boiman et al., 2008).For example, object detection, which is aimed to find one or more instances of an object in an image, is one kind of image classification problem.A second problem is view-based object recognition, the objects to be detected are instances of a certain class (e.g., apples or cars), while objects to be recognized are instances of the same object viewed under different conditions (e.g., the specially designated apple or car).A third problem is visual categorization, which refers to associating an image to two or more image categories, the former is binary classification and the latter is the so-called multiclass classification.Images depend on various parameters controlling format, size, resolution and compression quality, which make it difficult to compare image visual content, or simply to recognize the same visual content in an image saved with different parameters.This is in sharp contrast to other application domains like, for example, text categorization and bioinformatics.
The aim of this study is to select the kernels wellsuited to solve commerce image classification problems with support vector machine.Kernel-based methods maps the data from the original input feature space to a high-dimensionality kernel feature space and then solve a linear problem(find a largest margin hyper-plane) in the kernel space.The Kernel-based methods allow us to interpret and design learning algorithms geometrically in the kernel space, which is nonlinearly related to the feature space.The kernel functions are firstly required to meet the so called Mercer's conditions and a wellsuited kernel should incorporate the prior knowledge of the solving problem.

SUPPORT VECTOR MACHINE
Support vector machines have largely been motivated and analyzed with a theoretical framework, which is known as statistical learning theory (also called computational learning theory).Support vector machines produce nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space and give good generalization and bounds for the computational cost of learning.In fact the SVM classifier is solving a function-fitting problem using a particular criterion and form of regularization and have some edge on the curse of dimensionality.
Hard margin SVM: Let {x i, y i, i=1,2,…N}, x i, denote the feature vectors of the training set X and y i is their category label, which associate either of two classes, i ω , j ω and are assumed to be linearly separable, as illustrated in Fig. 1.The designed hyper-plane: ( 1 is the so-called hinge loss.

Soft-margin:
To deal with the non-separable case, the problem is as below; Minimize:  (7)   l(y, y`) = max (0.1-yy`)) is the so-called hinge loss.
C (capacity) is a tuning parameter to weight insample classification errors and it controls the generalization ability of SVM.The higher is the parameter C, the higher is the weight of in-sample misclassifications and the lower the generalization of the machine is.C is also linked to the width of the margin.The smaller is C, the larger is the margin and the more in-sample classification errors are permitted.Through Lagrange dual optimization, we construct and solve the convex programming problem: where, ( , ) ( ) ( ) The coefficients α* = (α * 1 , α * m ) T are always sparse, that is, there is little number of non-zeros coefficient, which are named as support vector.
Select a coefficient α * j ε (0, C), calculate the coefficient b * : and the final classification decision function is : Figure 2 illustrates the diagrammatic sketch of softmargin support vector machine

Multiclass classification:
The support vector machine is fundamentally a two-class classifier, however, in practice, many problems involving K>2 classes need to be tackled.In fact, the support vector machine can be extended to multiclass problems through solving many two-class problems.Several methods have been proposed for combining multiple two-class SVMs to build a multiclass classifier.3 (Platt et al., 2000).

KERNEL FUNCTION FOR IMAGE CLASSIFICATION
If kernel k (,) satisfies ( 1) and ( 2), then, k (,) is a valid kernel.(Mercer kernel) • K is N*N gram matrix, K i,j = k (x i , x j ) and it often refers to the kernel matrix.
The so-called kernel matrix (gram matrix) contains all the available information for performing the learning step (Shawe-Taylor and Cristianini, 2004; Barla et al., 2002), which is illustrated in Table 1.Through the kernel matrix, the learning machine obtains the information about the feature space selection and the training data itself, as illustrated in Fig. 4 (Shawe-Taylor and Cristianini, 2004).

Example kernels for image classification:
• Linear kernel: • Polynomial kernel: for any d>0: Polynomial kernel contains all the polynomial terms to the degree d.
• Gaussian kernels: for σ For the Gaussian kernel, the feature space is of infinite dimension.σ is the parameter which is called 'kernel width'.
the parameter p is often valued as the inverse of the average Chi-square distance.
• Histogram intersection kernel (Barla et al., 2003): It is a challenge task to build kernel functions, as they are not only to satisfy certain mathematical requirements but also to incorporate the prior knowledge of the application domain.For the image context, it is an extremely difficult problem, as the classification has to face the large variability of the images (Chapelle et al., 2002).

IMAGE DESCRIPTOR
In our experiments, we employed the so-called PHOW (pyramid histogram of words) (Lazebnik et al., 2006) as the image descriptor.PHOW descriptor is based on the popular BOW (bag of word) model (Bosch and Marti, 2007).The basic idea of BOW model, which is borrowed from text classification, is to represent an image as a histogram of visual keywords (visual words).The visual words are actually the clustering centers of local features in the images and all the collection of the visual words are called bag of words.Figure 5 illustrates the construction of BOW, which is as follows: • Automatically detect of interest points/interest areas or local blocks • Represent the local areas as local descriptors (such as SIFT (Lowe, 2004;Mikolajczyk and Schmid, 2005)   • Cluster all the image descriptors with clustering algorithm (i.e., K-means) to form a number of cluster centers (visual words) • Calculate the visual keywords distribution in an image and form the visual keywords histogram The traditional BOW model ignores the characteristics of the spatial position of the images and employs sparse sampling mode.Lazebnik et al. (2006) proposed a improved descriptor named as PHOW (Pyramid Histogram Of Words).The improvements are from two aspects (Fig. 6): • Use dense sampling instead of sparse sampling for feature extraction.The sampling interval is set to eight pixels, each 16 × 16 pixel block forms a 128dimensional SIFT descriptor.• Represent an image with multiple resolutions (from low resolution to high resolution), each with a series of visual keywords in the feature space.In this study, the pyramid level is set to 3 (l = 0, 1, 2) and the number of visual words is 300, then the eventually formed PHOW dimension: 300 +300 × 4 +300 × 16 = 6300.

EXPERIMENT
Experiment set: All the experiments were performed on a computer with Intel Pentium CPU 2.66GHz and 4GB RAM, which run Windows XP and MATLAB2010.The popular SVM toolbox-Libsvm (Chang and Lin, 2001) were employed.For multipleclass classification, the large margin DAG strategy (Platt et al., 2000) is adopted.The kernel parameters (C, σ ) were obtained through a ten-fold cross- validation on each training set.

RESULTS AND DISCUSSION
Table 3 illustrates the classification accuracies of 20 commerce categories with the five kernel functions.From the experiment results, we can conclude as blow: • Chi-square kernel and histogram intersection kernel performs much better than the three general kernels (linear kernel, Gaussian kernel and poly nominal kernel).• Chi-square kernel is slightly superior to the Histogram intersection kernel, while the linear kernel perform worst.• The performances of all the kernels are becoming better as the number of the training samples with each category increase, particularly as the training sets increase from 5 to 15 samples each category.The average accuracies are becoming relatively stable as the training samples of each category are up to 30.

CONCLUSION
In this study, we compare several common kernel functions for commerce image classification with the PHOW descriptors.The experiments illustrate that chisquare kernel and histogram intersection kernel are more effective with the histogram based image descriptor for commerce image classification.However, it is still a challenge task to construct more appropriate kernel functions for image classification.

Fig. 3 :
Fig. 3: The diagrammatic sketch of DAG for finding the best class out of classes• One-versus-the-rest approach: One-versus-therest approach is one commonly used approach, it constructs K separate SVMs, in which the kth model is trained using the data from class Ck as the positive examples and the data from the remaining K-1 classes as the negative examples.However, the training sets of the one-versus-the-are imbalanced.• One-versus-one approach: The one-versus-therest approach is to train K(K−1)/2 different 2-class SVMs on all possible pairs of classes and then to classify test points according to which class has the highest number of 'votes'.However, this approach can also lead to ambiguities in the resulting classification.•DAGSVM approach: This method organizes the pairwise classifiers into a directed acyclic graph.For K classes, the DAGSVM has a total of K (K − 1)/2 classifiers and to classify a new test sample, only K − 1 pairwise classifiers need to be evaluated.The particular classifiers used depend on which path through the graph is traversed.(see the example of four-classification in Fig.3(Platt et al., 2000).

Table 1 :
The kernel matrix K 1