Sunday, April 1, 2012

Practical kernel trick

When using an SVM, a naive usage of the software may lead us to use the standard kernels that are offered to us and perform the analysis as well as we can.

However, one can choose to build his own kernels given some prior information about the problem. For example, we could have a dataset with values of instantiated variables for each datum, but we might also have correlations telling us when those data simultaneously appear. For example, in credit rating, one may have variables such as income for each of the customers. One might also have some kind of information linking customers, such as marital status, housing status, professional category and so on. Therefore, apart from the points in $\mathbb{R}^n$ (wages), we also have some kind of coexpression graph, where two individuals score higher with each other if their categorical variables coincide, so that we build a matrix $S_{i,j}=\mbox{#}(k,j)$, where $\mbox{#}(i,j)$ is the coincidence of the vales for all variables between individual $i$ and individual $j$. We then could make a custom kernels in this way:
$$
K=K_{\sigma} + \gamma S^*
$$
Where $K_{\sigma}$ is the classical Gaussian kernel build from the euclidean distances of the wages, with parameter $\sigma$, and $S^*$ is a projection of the previous $S$ onto the cone of positive semi-definite matrices. $\gamma$ is a tuning parameter. Now, how can we use it with our SVM software?

LibSVM can be fed with the previous kernel matrix $K$ if you work with the C/C++ or the Matlab interface. Just make sure to insert an index from one to the number of data as the first column. If you are working with R with the library e1071, you won't be able to do this.

It is well known to practitioners that the kernel-trick can help us in this case. Since the Mercer's kernel expansion is
$$
k(x,y)=\sum_i \lambda_i \phi_i(x) \phi_i(y) = \Phi(x)^T \Phi(y)
$$
Where $\lambda_i$ and $\phi_i$ are the eigenvalues and eigenfunctions of the integral operator whose positive definite kernel is $k$. $\Phi(x) = [\sqrt{\lambda_1} \phi_1(x), \sqrt{\lambda_2} \phi_2(x),\ldots]$ is, therefore, the "lifting" map from $\mathbb{R}^n$ to the Hilbert space of square-summable infinite sequences $l^2$. Since we are working with a finite approximation to this operator, the eigen-approximation is given by the eigen values and eigenvectors of the kernel matrices above. We will always be able to compute these. Therefore, we substitute $ \Phi(x)^T \Phi(y)$ by its finite approximation using the eigenvalues and eigenvectors of the matrix, so that using a standard linear kernel with the lifting map (the images under the approximated $\Phi$) we achieve the values in $K$.

The following example is in R. Assume we want to compute some credit score input, and the individuals' salaries are the ones given in the variable salaries. The experience with these customers is +1 if they were good customers, and -1 if they were bad customers, and stored in the variable y.
library(e1071)
salaries=c(20000,30000,55000,50000,30000,25000)
y=c(+1,+1,+1,-1,-1,-1)
 It is clear that the experience and the salaries are not very correlated. Imagine betting is a huge problem for some segment of the population, and we have the individuals' record of using betting facilities, in this matrix
betting=c(0,0,0,1,1,1)

S=betting%*%t(betting)
Seig=eigen(S)
Seig$vectors
values=Seig$values
values[values<0]=0
D=diag(values)
Ss=Ss=U%*%D%*%t(U)
s=0.000000005
Ks=as.matrix(exp(-s*dist(salaries)^2))
diag(Ks)<-1
The gaussian distances are:  
2 0.606530660                                               
3 0.002187491 0.043936934                                   
4 0.011108997 0.135335283 0.882496903                       
5 0.606530660 1.000000000 0.043936934 0.135335283           
6 0.882496903 0.882496903 0.011108997 0.043936934 0.88249690


We now see that the true correlations between these individuals are not very strong. Therefore, supplementing Ks with S is a good idea. Ks+gamma*S is:
            1          2           3          4          5          6
1 0.000000000 0.60653066 0.002187491 0.01110900 0.60653066 0.88249690
2 0.606530660 0.00000000 0.043936934 0.13533528 1.00000000 0.88249690
3 0.002187491 0.04393693 0.000000000 0.88249690 0.04393693 0.01110900
4 0.011108997 0.13533528 0.882496903 1.00000000 1.13533528 1.04393693
5 0.606530660 1.00000000 0.043936934 1.13533528 1.00000000 1.88249690
6 0.882496903 0.88249690 0.011108997 1.04393693 1.88249690 1.00000000


We see that there is some more correlation between the betting public. Following the R program
K=Ks+gamma*S
Keig=eigen(K)
phi=Keig$vectors%*%diag(sqrt(Keig$values))
Now we train the SVM with this embedding, or lifting map, $\Phi$. Observe that if $K=VEV^T=VE_q E_q V^T = \Phi^T \Phi$ then the embedding is $\Phi^T=VE_q$, where $E_q$ has the square root of the eigenvalues.
svm(phi, y, type="C", kernel="linear", cost=1)
Subsequent customers (after retraining) will be given a more fitting credit rating in this fashion than using only their salaries.

This trick is also the basis of the celebrated technique of random features (check out the paper).

No comments:

Post a Comment