Sunday, September 30, 2012

Kaggle

Reading mathbabe's blog, I learnt about Kaggle. This is a site that provides data so that data analysts can put their techniques to work. This data is provided by institutions who are interested in having their data analyzed. Then, the data is analyzed in the form of a contest, and he who gets the best results by the time the competition ends, wins a prized paid by the data owner. This is best summarized by its Wikipedia's page:
  1. The competition host prepares the data and a description of the problem. Kaggle offers a consulting service which can help the host do this, as well as frame the competition, anonymize the data, and integrate the winning model into their operations.
  2. Participants experiment with different techniques and compete against each other to produce the best models. For most competitions, submissions are scored immediately (based on their predictive accuracy relative to a hidden solution file) and summarized on a live leaderboard.
  3. After the deadline passes, the competition host pays the prize money in exchange for "a worldwide, perpetual, irrevocable and royalty free license [...] to use the winning Entry", i.e. the algorithm, software and related intellectual property developed, which is "non-exclusive unless otherwise specifies


I have seen interesting designs and topics. It is, no doubt, a very interesting source and might even provide a good topic for a thesis, from the practical point of view. I have yet to study it more deeply.

Saturday, September 29, 2012

Improve the performance of your SVM

Though I am not very keen on differential geometry (others aren't either, but they claim to be researching on the field), I find it amusing to read a little bit of it when it is used along with kernel methods, and especially when you can improve the behavior of a SVM with it.

Amari and Wu are responsible for the following method: The idea is that, in order to increase class separability, we need to enlarge the spatial resolution around the boundary in the feature space. Take, for instance, the Riemannian distance along the manifold
$$
ds^2 = \sum_{i,j} g_{i,j} dx_i dx_j
$$
We need it to be large along the border of $f(\mathbf{x})=0$ and small between points of the same class. In practice, the boundary is not known, so we use the points the we know are closest to the boundary: the support vectors. A conformal transformation does the job
$$
\tilde{g}_{i,j}(\mathbf{x}) = \Omega (\mathbf{x}) g_{i,j} (\mathbf{x})
$$

This is very difficult to realize practically, so we consider a quasi-conformal transformation to induce the a similar map by directly modifying
$$
\tilde{K}(\mathbf{x_1},\mathbf{x_2}) = c(\mathbf{x_1}) c(\mathbf{x_2}) K(\mathbf{x_1},\mathbf{x_2})
$$
where $c(\mathbf{x})$ is a positive function, that can be built from the data as
$$
c(\mathbf{x}) = \sum_{i \in SV} h_i e^{\frac{\| \mathbf{x} - \mathbf{x}\|^2}{2\tau^2}}
$$
where $h_i$ is a parameter of the $i$-th support vector.

Thus, if you first train a SVM with a standard kernel, and then you compute $c(x)$ and make a new kernel with the previous expressions, your SVM will behave better.

The authors report higher classification accuracy and less support vectors than with standard kernels.

Check out the paper:
http://www.dcs.warwick.ac.uk/~feng/papers/Scaling%20the%20Kernel%20Function.pdf

Wednesday, September 19, 2012

Python function to build a block matrix


Sometimes it is useful to build a matrix from matrices forming blocks of the former one.

As I understand, neither numpy nor scipy implement this functionality. Scipy has the scipy.linalg.special_matrices.block_diag function, but it only admits blocks along the diagonal. The following function builds a matrix from the elements of the input list a, as long as their dimension is compatible.


 def blockmat(a):  
   n=size(a,0)  
   N=size(a[0],1)  
   sqn=sqrt(n)  
   ind=0  
   rows=[]  
   cols=[]  
   for i in range(0,n):  
     A=a[i]  
     M=size(A,0)  
     #if (M != N):  
     #  return -1   
     if ((i>0) & (i%int(sqn) == 0)):  
       rows.append(numpy.concatenate(cols,axis=0))  
       cols=[]     
     cols.append(A)  
   rows.append(numpy.concatenate(cols,axis=0))  
   return numpy.concatenate(rows,axis=1)  

Tuesday, September 18, 2012

Python IDE: Spyder

I've been working with the Spyder IDE for Python. It is time I share my impressions.

First of all, it is a very strange feeling what you get when working with the IDE. In fact, it is the same feeling (but even more exaggerated) than you get with R Studio (in the case of programming for R). I feel like there is a force that prevents me from doing my work. In this regard, Spyder looks more like R Studio than to Matlab (a resemblance this last one that is the developers' motto).


I reckon that the IDE surpasses every other IDE that I've tried out so far, but there is still this unavoidable feeling. I think part of it comes from the fact that when you are executing part of the code just to see if it really does what it is intended for, it leaves the windows focus at the command line panel, and there is no keyboard combination to make it go back to the current file that is in edition. It is also possible that I am using Ubuntu and this software is more ready for KDE since it uses the Qt libraries.

On the other hand, you have the handy feature of integration with pdb, the Python debugger, showing variables of the current scope, an online help which shows documentation for any loaded name.

A fine IDE but that still needs maturing a little bit.

spyder-ide.blogspot.com

Friday, September 7, 2012

Twitter sentiment trading

From Gekkoquant, I found something interesting. Apparently, some hedge fund established in London is using twitter sentiment to trade the equity markets.

The idea stemmed from the paper Twitter mood predicts stock market, by J. Bollen and co-authors.

They identify market sentiment and then trade the markets with a 3-day lag. It is a tiny market operation yet so I take this cautiously (it is a natural law that humans try to deceive other humans so as to easily take their wealth), but still I regard this issue as interesting.

Quoting Gekkoquant, who also has an interesting series of posts about using the tweeter feed on Python
Interesting interview with Paul Hawtin from Derwent Capital about their twitter fund and some of the implementation details. Key things to note is that they analyse all tweets (no filtering for just FTSE companies), it’s not a blackbox system the mood signals are only single component of their strategy.
Also, here is a video of co-founder Paul Hawtin explaining what they do


I will have a look at the paper and update this post.

Check out the paper also here.

Monday, September 3, 2012

Characterization of independence of random variables

Sure you know that two random variables $X$ and $Y$ are independent if and only if $P_{XY}(x,y) = P_{X}(x)P_{Y}(y)$. What you might not know is that if
$$
\max_{f ,g\in \mathcal{F}\times \mathcal{G}} cor (f(X), g(Y)) = 0
$$
for $\mathcal{F}$ and $\mathcal{G}$ a sufficiently large family of functions and contain all continuous functions (each on the domain X and Y take values, say $\mathcal{X}$ and $\mathcal{Y}$), then independence holds.

This was first discovered by Sarmanov in 1958 and Generalized to the multivariate case by Lancaster in 1962. With this base, Bach and Jordan created the Principal Component Analysis.

Despite not being a mainstream characterization of independence, the book by Jacod and Protter entitled Probability Essentials lists it.

Fukumizu, Gretton and Sridemdupudur have also been working on independence measures using the RKHS theory. If one uses characteristic kernels, which can span again all the Fourier features, then one has again both families of continuous functions. They compute the mean function of the measure as
$$
\mu_X = \int k(x,\cdot) d P_X (x)
$$
Now, given two mean functions $\mu_X \in H_X$ and $\mu_Y \in H_Y$, independence holds if and only if the Hilbert-Schmidt norm of the covariance operator $C_{XY} : H_X \rightarrow H_Y$ is zero.

Another characterization of covariance was given by Rényi in 1959. Let $\hat{f}_{X}(\omega)$, $\hat{f}_{Y}(\eta)$ be the Fourier transforms of both probability density functions, and $\hat{f}_{XY}(\omega,\eta)$ be the joint probability density function, if
$$
\int \hat{f}_{XY}(\omega,\eta) - \hat{f}_X(\omega)\hat{f}_Y(\eta) d\omega d\eta = 0
$$
then independence holds. From there, Székely and Rizzo created a statistic and a non-linear correlation estimator.