Thursday, July 18, 2013

Orange Machine Learning (Python), the charm of Machine Learning

I asked about a good visualization tool on Kaggle, and D33B pointed out to Orange. Despite not being what I was asking for, checking the tool out revealed it to be awesome (M. Horbal felt that way too).

You will never win a kaggle competition with Orange, but it will certainly help you deal with data and build models very, very quickly and intuitively. In fact, I used it to quickly build a prototype model for a company's data which was very nasty (loads of missing values, numbers with quotation marks...). I quickly build a linear regression and visualized some scatterplots and conditional distributions. All of it with this nice workflow

In the image above, the only datasource is the file element. For the analysis pipeline, we first select the attributes that have a number of values in the independent variables, since these variables incrementally have less and less values. We are interested in keeping most of the values and still visualize the relationship between the attributes. After selecting the attributes, we tell Orange to prune the data before injecting it into the several elements after that. We want to see the conditional distributions in case we can get rid of non-informative attributes, we want to study potential linear relationships, see their correlations via a distance map and perform a linear regression (also ridge and lasso). On the other hand, we also want to study the regressors in depth and for that we select only the regressor attrubutes in the pipeline below.

Definitely a piece of software to have in your toolset.

No comments:

Post a Comment