Machinomics: 2014

Wednesday, March 26, 2014

Clojure HashMap usage within a Java class

I attended the Hack (Make!) the Bank hackathon that took place on Level 39 (One Canada Square in London). My team decided to make a funds reallocating application that makes automatic payments once an account is topped up. Long story short, we got stuck at interoperability with the Java classes providing Paypal access.

We made the mistake of inserting a Clojure hash map directly to the Java class, which seemed to work but the moment we switched to paper account in the Java class (by modifying the incoming HashMap argument) the system threw an exeption. Ben from Erudine Financial realized that the structure we were passing was inmutable so we just created the proper Java HashMap instance.

(java.util.HashMap. {})

Remember that Clojure's {} is actually a clojure.core.PersistentHashMap.

This situation happens especially when clumsily using proof of concept code in production.

Sunday, March 2, 2014

The Gaussian kernel maps data onto the sphere

It is a fact as surprising as trivial that the Gaussian kernel maps your data onto the infinite-dimensional sphere. No computation regarding the RKHS basis are required, since, given the kernel
$$k(x,y)=\exp(-\gamma \| x-y\|^2)$$
defined on the domain $X$, inducing a map $\Phi: X \rightarrow F$, where $F$ is the associated feature space.
We have that $k(x,x)=1$ for all $x \in X$. Therefore what we have is clearly the sphere, since all $x$ are one unit await from zero in the feature space $\|\Phi(x)\|^2 = \sum_i \lambda_i \phi_i(x) \phi_i(x) = k(x,x)=1$. Is there any possible refinement to this? There is! Remember that the Fourier transform of a Gaussian is a Gaussian (with inverted paramers, etc), so we have that the Fourier coefficient 0 (i.e., the power of the constant function, or $cos(0)$) is positive (and maximum among the coefficients). This means that all data have a positive first entry (the constant function is positive and its coefficient is positive), which means that the map actually is from the domain to the positive halve of the infinite hypersphere. Other basis functions (for coefficients other that zero) are sines and cosines and thus may change points. Further characteristics of the mapping depend on the data probability measure.

If you have been trying to apply Differential Geometry to kernel methods and have worked with the Gaussian without noticing it, please stop your research and do something else. A good review and analysis on the induced manifold is Geodesic Analysis on the Gaussian RKHS hypersphere, where the authors make again the same mistake many people do: Naming the feature space as the RKHS (IT IS ITS DUAAAAALLLLLL). In fact, they show that the maximum angle between a pair of points is $\pi/2$, which makes the largest possible embedding a quadrant.

Wednesday, February 26, 2014

On the computer languages market

There is an interesting series of articles on Dr. Dobb's essentially about what we can call the computer languages market, where they analyze trends and caracteristics, especially in the Editor in chief's article.

They analyze the trends in language usage, which ordinally had not change since the year before (C, Java, Objective-C, C++, C#, PHP, Visual Basic, Python and Javascript), but that show a number of developments. For instance, Perl is leaving room in favor of Python. Perl seems to be dying and with signs of becoming history. They also mentioned a strong resurgence of Javascript but I would like to add that a large number of developments in the web scripting world have been carried out, starting with CoffeeScript, a tiny language that translates 1-to-1 to JavaScript. See their web page for some use of the language. ClojureScript is another compiler, this time to translate Clojure into JavaScript.

I found the following paragraph of particular interest:

By all measures, C++ use declined last year, demonstrating that C++11 was not enough to reanimate the language's fortunes, despite the significant benefits it provides. I have previously opined that Microsoft's contention of a return to native languages being led by C++ was unsupported by evidence. It is now clearly contradicted by the evidence.

I feel encouraged by this statement. I believed C++ is an awful language, and always has been. Back in the late 90's when I started programming and my teenage budget in a remote outpost included a non-disposable 486 DX2 with DOS, we had no choice but to get mainstream technical stuff, and that included C++ as the only advanced systems language. My experience, and I believe everybody's experience, is that programmer's time is more important than running time, and even hardcore C++ programmers recognize they put themselves in pain when facing non-standard tasks with the language. Some would argue that pain is what you pay for system performance, but that is not true (see the paragraph below). C++ will never grow on Big Data (data processing), mobile development or cloud computing, and I would expect that performance computing to also cut C++ usage in favor of more modern, maintainable system languages, such as the D language (or the one I describe below).

An interesting development they had not talked about in the area of systems programming, more than the D language, is Nimrod. It has cool features such as pointers that are traced by a lightweight garbage collector and others that directly translate to C++ pointers. Templates that can be called as operators are included, supporting metaprogramming, it supports inmutable objects, a number of syntactic sugars (such as being able to call len(x), x.len or x.len if there is only one argument, command and function-like calls like echo("hello") and echo "hello"...), first class functions and a very good mixture between Python indentation-defined blocks and Scala function definitions, as opposed to D's C-like syntax. Nimrod compiles to C, C++, Objective C and JavaScript. Definitely a good candidate to learn and to watch.

On the functional but high-performant side of the market we have Erlang, Haskell, Scala and Clojure. The first three are older, especially the first two ones. In my opinion, Clojure is simplest, especially when compared to Scala. This is patent when examining a source coude listing in both languages. Clojure lifts the programmer's productivity to new highs (I was able to code the Denarius' matching engine core in half a day). Definitely the way to go in the future.

Monday, February 10, 2014

Machine learning in a few words

Machine learning is becoming a buzzword, everybody talks aboit it and few seem to be interested in the math underneath (I find statements like "I wanted to know more but all sources were too statistical/mathematical and I wanted more practical stuff").

Let me tell you something:
First: You can't really use Machine Learning if you don't know the statistical/mathematical basis
Second: You can't really use Machine Learning if you don't know the statistical/mathematical basis
Third: You can't really use Machine Learning if you don't know the statistical/mathematical basis

Machine Learning is just a fancy word for the statistical/mathematical tools lying underneath, whose objective is to extract something that we may loosely call knowledge (or something that we understand) from data (or something chaotic that we do not understand), so that computers may take action based on the inferred knowledge. An example of this would be a robot arm/humanoid: without programming actions on direction/velocity/acceleration vectors based on an established model, we may put sensonrs on a subject's articulations, and from these datapoints learn a regression model on the manifold of natural movements. Another example is in Business Intelligente: We may learn groups of customers (market segmentation) so that we may engage several groups with specific policies or offers target at them.

Maching Learning is applied Statistics/Mathematics. Is very little and very unpractical without Optimization/Operations Research, from the algorithmic and practical/scalable point of view.

I've come to the conclusion that there exists two large main approaches to ML, disregarding the specific technique we are dealing with and its target (i.e., supervised or unsupervised), plus one in the middleway:

Functional approach (Mathematical)
Neural Network/Deep Learning approach (Middle way)
Probabilistic approach (Statistical)

In the functional approach, one uses the theory of Hilbert spaces (and therefore of differential equations and linear algebra). The goal is to find a set of transformations of the data so as to best perform in a score for the task (called functional). These transformations come from a pre-defined set which is not related to the data in any way and are a combination of possibly orthogonal basis of a space of functions (transformation) defined on the domain of the original data. Examples of this are: Linear/Ridge/Sparse Regression (linear or identity transformation for regression), SVM (non-linear using the kernel trick), PCA (SVD/eig computation) and KPCA, Matrix factorizations (for signal separation/clusering...), K-means, Projection pursuit... The basic idea is:

I have my data set and a buch of (linear or non-linear) transformations, find a solution applying these transformations to my data so that I can maximize a score functional that I like for my problem. If I want to predict something (classification/regression) then combine these transformations so that the combination fits best my target variable. If I want to examine the nature of the data (dimensionality reduction, matrix factorization, clustering), then use a combination of my transformations to lose as least information as possible (measured with a functional, again).

In the probabilistic approach, the prior knowledge is an assupmtion of the prior probability and the likelyhood, and works towards obtaining a posterior probability (that the outcome is a given choice given the data just seen). Examples are: Logistic Regression (simple non-linear transformation for classification), Naive Bayes (classification), SNE, Gaussian Processes... The general idea is

I want to see independence at the end of the process, therefore I can assume Gaussian multivariate variables so that a linear or non-linear transformation gives me components that are as independent as possible (for unsupervised learning) or assume a probability distribution at the output and a likelyhood and compute a model so that I best fit the likelihood of seeing the target variable given the data.

Neural networks and deep learning are a different story. I consider them to be their own field, drawing tools from the above. They are neither functional because they are not dealing directly with functions (transformations) in the functional analytic setting, and they are not probabilistic for obvious reasons, but use any probability or information theoretic tool as needed. The fact that they connect output of transformations to inputs makes this field related to the first approach indeed, seen as a chain of transformations (spaces defined on the image of their predecessors), but the focus here is obviously the algorithms and that changes things.

Reinforment learning is nowadays just a fancy word for techniques widely known, studied and used in Stochastic Processes, HMM (Hidden Markov Models) being the only exception. Sometimes they call it Sequential Learning, but it is not widely considered Machine Learning, neither scholarly nor popularly.

So, as you can see, there is nothing new (not at least as the discovery of the fundamental theorem of calculus, or of quantum mechanics).

Regarding novelty, I am annoyed each time I read about ML techniques from the big data guys, who are normally programmers starting to do something more that data aggregation and querying. It seems that there is something both new and exotic in what they are saying, and it it mostly well known techniques from statistics, such as this article, or the beginning of Sean Owen's presentation at Big Data Spain.

Now, the practical side of things require that ML scales to Big Data. That limits the applicability of matrices and non-linear transformations to adapt linear methods, so let's see what the next breakthrough is.

Saturday, February 8, 2014

Use Java JARs within Clojure projects

I am creating a clojure wrapper for the excellent library from Pekka Enberg Falcon, which will provide Denarius with FIX protocol communication capabilities. The new project's name is, obviously, falcon-clj.

Although the ideal packaging would be to include Falcon as a Git submodule, I decided to pre-compile Denarius into a JAR stored into the /lib directory to avoid Leiningen complications in the starting phase and be able to focus on the implementation.

I compiled the library and pushed it into Github and then I received my new laptop, so I installed everything from scratch, including the JDK, and cloned everything back into the new laptop. The next stage would be to use the REPL to quickly get used to Falcon.

=> (import [falcon.fix Session Versions])
UnsupportedClassVersionError falcon/fix/Session : Unsupported major.minor version 51.0 java.lang.ClassLoader.defineClass1 (:-2)

So I digged up a little bit and found some interesting information here, which led me to here and then to correctly configure Eclipse (Aptana) to use an external JRE to run things.

Basically, you need to explicitly tell Eclipse to use the external JDK you are using. You can find the setting under Window -> Preferences -> Java -> Installed JREs.

Saturday, February 1, 2014

What lies underneath malloc

malloc is C's dynamic memory allocation function: memory is assigned to the process heap and then the malloc dynamic assignation does its magic, to make boundaries between dynamically assigned variables. Here's how it works (bear in mind that a malloc implementation is largely compiler-dependent):

At the beginning of the process' heap free space, a special region is allocated to point to the next available byte address in the free region and a previous pointer to the pointers corresponding to the previously allocated block. In case there is no block allocated next, the pointer is null. Otherwise, the pointer will point as many bytes ahead in the process memory as the corresponding allocated block's size.

The function free updates the following block's previous pointer after the recently liberated block, which causes mismatch with the preceding block's next pointer pointing to the deleted block. This creates an accounted free block. To speed up, free can have its own list of free block and sizes elsewhere on the heap.

Friday, January 31, 2014

Some sad goodbies :-(

I am sadly taking out several blogs from my read list at the right...

Our teacher (by his book) Larry Wasserman decided to stop writing on his blog. There, he mentioned that sparse regressors, being as useful as they are, have a infinite risk in a region near zero, which could be of no practical consequence for most applications but may be too dangerous for some cutting edge applications (precision robotics, high frequency trading...). He left the door open for a comeback, we certainly hope so!

mathbabe (Cathy O'Neil). She writes a lot and I was finding her blog posts a little bit confusing for and many of them not in line with what I want to read or show in my blog, even though I support some of her views and agree on her reasons to short Facebook, which in turn coincide with a recent Princeton application of an empidemology model to social networks.

Peekaboo (Andreas Müller), seems to have left his blog aside and moved on. He has very interesting material there, though.

Monday, January 27, 2014

Strongly typed external data sources language integration in F#

I have read something that I bot like and consider important: Data sources integration in the language.

I like it because it is a really interesting feature in the sense that data is integrated into the language just like control flow, for example, quoting the paper by Syme et al. (see below), we may use

let data = Samples.DataStore.Freebase.FreebaseData.GetDataContext()
data.Sports.Baseball.Coaching_Positions

Even when specifying the route to te database Coaching_Positions, the IDE itself will show us all options as it finds them, much like code completion with language keywords and loaded types.

On the other hand, this is important because is of high value for the data scientist. Moreover, in F#, they have made it so as to scale with internet data (Big Data) and especially designed it to word from the cloud, making the model especially appealing for modern use. I hope Clojure catches up in this field and adds something of the like soon.

http://research.microsoft.com/pubs/192598/ddfp-information-rich-fp-themes-v5.pdf

Saturday, January 25, 2014

Denarius Exchange organization created on Github

Dear all,

I've transferred the Denarius' repository to an organization focused on developing the exchange. Anybody is invidted to join and contribute.

Te new address is:
https://github.com/denarius-exchange/denarius

On the technical side, a matching is implemented with hash maps now, and we get slightly more speed, but we are introducing vectors and queue operations on them, which will yield about five times more speed.

Next ahead are the connection nodes and protocols implemented on them.

Friday, January 24, 2014

Monty Hall revisited

If I tell you: There are 1K bitcoins 1 wallet that will be yours if you guess which wallet out of three is the right one, the rest containing an amount of zero bitcoins, and ask you to point out an initial selection, then showed you that, effectively, one of the remaining wallets contains zero bitcoins... and finally giving you the opportunity to change wallet. Would you change? The awnser is yes.

This is so because there is new evidence now that supports a higher probability that the remaining unseen wallet is the right choice, whereas there is none about your current choice. The fact that you selected wallet 1, and given that choice, I showed you wallet 2, that leaves wallet 3 with a posterior probability of 2/3. This does not happen for our current wallet 1, since choosing 1 influenced my decision to show you 2. More precisely: You chose wrongly with probability 2/3. With that probability, I show you the only possible door that I can, leaving the 2/3 for the remaining unseen and unchosen door. On the contrary, you choose well with 1/3 probability, but then I can choose among 2 doors to show you, each with a probability of 1/2. This is how we include my decision (or necessity) to show you 2 into the math (this is the best explanation you are gonna get from all over the internet):
Let's call R "right choice" V "visible incorrect wallet" and S "your choice". We need to compute $P(R=3|V=2,S=1)$, the probability of 3 being the right wallet, after you selected 1 and I showed you that 2 was not right (remember that all priors are 1/3).
$$P(R=3|V=2,S=1)=\frac{P(V=2,S=1|R=3)P(R=3)}{P(V=2,S=1|R=3)P(R=3)+P(V=2,S=1|R=1)P(R=1)}\\=\frac{1\times 1/3}{1\times 1/3 + 1/2 \times 1/3}=2/3$$$P(V=2,S=1|R=3)=1$ is the probability that, given R=3, then I was forced to show you the incorrect wallet remaining (you already chose one incorrect wallet). $P(V=2,S=1|R=1)=1/2$ because there are two possible incorrect wallets (since you selected the correct one) that I can choose from to show you.

Let's compute the same posterior for the case I decide not to change wallet:
$$P(R=1|V=2,S=1)=\frac{P(V=2,S=1|R=1)P(R=1)}{P(V=2,S=1|R=3)P(R=3)+P(V=2,S=1|R=1)P(R=1)}\\=\frac{1/2\times 1/3}{1\times 1/3 + 1/2 \times 1/3}=1/3$$.

Therefore if you change you have more chances of winning the 1000 bitcoins.

Needless to say, this works for every possible combination of $R$, $S$ and $V$.
This happens, as I mentioned, because of the way I was influenced (forced) to show you the incorrect remaining wallets. To see it intuitively, imagine 100 wallets, and that you chose one amongst them, and I am forced to show you 98 incorrect wallets, leaving your choice and another one. Is it more likely that this particular wallet is the correct one (that your choice forced me to leave it) or that you chose wisely amongst 100 wallets? If you choose 99 incorrect wallets, the set that I show you is the same, except for the chosen incorrect wallets each time, and will never contain the particular correct wallet.

There is a cool Android app in case you want to check how the law of large numbers works for this problem.

Saturday, January 18, 2014

Things about Python that I find cool

Python is a formidable language with functional capabilities that is extraordinarily easy to learn and follow. It is interpreted so that scripting and prototyping are possible.

Commentaries follow UNIX style so that you can use it for building scripts (and therefore program CGIs).

Despite striking as too odd at first sight, Python's syntax overweights its cons by offering a clear and simple look which noticeably increases productivity. Python relies on indentation to define its blocks, which means that you get rid of parentheses, brackets, black keywords such as begind or end, etc. This simplifies code and makes it more readable

for i in vector:
     if i==1:
        print i<<1
     else:
        print i**2
     print "Loop: %d" % i

IPython notebook (which requires Tornado libraries), is an interesting addon over IPython (which itself gives you nice command line ammenities). It is a Javascript/CSS served over a local HTTP server which allows you for nice online prototyping on your web browser. Results are inlined much like a Maple or Mathematica session.

It allows for powerful expressions, such as performing a transformation on filtered data with a near natural language expression:

[x**2 for i, x in enumerate(col) if i!=0 or x]

Here we transform all elements in the vector col squaring them but only those whose index is not zero.

Decorators are a powerful way to add behavior to a function, adding code that performs some task before and after the target function is executed. The canonical use of decorators is adding logging behavior to functions. In Python we can use them easily and non-intrusively by adding the operator @ and the decorator function to the definition of the function

def decorator(func):
    def inner(x):
       print "Arguments were ", x
       y = func(x)
       print "Result is", y
    return innter

The argument operators * and ** allow for generics construction within the functions. By convention, one uses *args for sequential expansion of the argument list and of and **kwargs for named expansion of an argument list. This means that you can use * in the function definition to represent a list of arguments passed in order, therefore *args becomes a list of values that you can then use. On the other hand, calling a function that has named parameters with a list of values with the operator * will result in expanding the values to these arguments.

def test(a=1, b=2):
print a, b
args=(1,2)
test(*args)
1 2

Similarly, **kwargs cab be used to either specify a list of generic argument list to make the behavior of the function dynamical (for example, your function is a relay and just gets arguments that passes on to other functions that will interpret them), or be used to expand a map of named arguments in this way

def test(a=1, b=2):
print a, bs
kwargs = {"a":1,"b":2}
test(**kwargs)
1 2

Althoug the following does not refer to the language itself, it has to do with the community, since there exists a service called PyCloud that enables Python practitioners to use CPU power for their Python numerical needs. As per the PyCloud site:

The PiCloud Platform gives you the freedom to develop your algorithms and software without sinking time into all of the plumbing that comes with provisioning, managing, and maintaining servers.

You have 20 free computing hours.

Finally, a powerful and comprehensive set of libraries has been built on Python, especially for Data Analysis: Python probably has the largest collection of opensource libraries for data analysis. It includes: NumPy, SciPy, Pandas, Sklearn, NLTK and many others.

Monday, January 6, 2014

Denarius financial exchange announcement

I've been busy on Christmas working on Denarius, a financial exchange. This is the announcement that I've benn posting on several sites (see, for example, on bitcointalk).

Dear all,

I just started out a new open source project to implement a financial exchange project.

The implementation is in Clojure, which has already given the project a kick start with the matching engine concurrency. Interested Clojure programmers are invited to join.

Also, programmers working in finance, parallel systems and databases are kindly invited to join. Contributions to the core system design are highly welcome.

This project stems from unsatisfactory use of other existing projects regarding financial exchanges, and is of interest for the bitcoin community since cryptocoins are among the tradeable assets in the software.

This account is newbie's, so diffusion to other parts of bitcointalk is appreciated.

Subscribe just sending an email to: denarius@librelist.com
Source code: https://github.com/analyticbastard/denarius
Wiki: https://github.com/analyticbastard/denarius/wiki/Introduction

Machinomics