Sunday, September 29, 2013

Kaggle survey results

The next are the results of the Kaggle survey that I conducted, in which several questions regarding Machine Learning/Data Analysis were asked to participants to extract their personal views on the subject and the tools they used. The online platform on which the survey ran does not offer much analytics beyond copying and pasting aggregated results per question, so here you go:

(the link http://es.surveymonkey.com/s/SYYTCF2 )

TOTAL PARTICIPANTS: 40


1. What is your background?

Biochemistry         0,0%    0
Chemistry         0,0%    0
Computer Engineering (Software Development)        30,0%    12
Computer Science (IA/Machine Learning)        12,5%    5
Econometrics         0,0%    0
Economics        5,0%    2
Engineering (Electrical)        5,0%    2
Engineering (Mechanical)         0,0%    0
Engineering (Other)         0,0%    0
Mathematics        15,0%    6
Medicine         0,0%    0
Physics        7,5%    3
Statistics        12,5%    5
Other (Science Applied)        7,5%    3
Other (Science Pure)         0,0%    0
Other        5,0%    2



2. What is your language of preferred usage for data analysis tasks?

Bash/sed/awk/any shell         0,0%    0
C/C++        2,5%    1
Excel         0,0%    0
Java        5,0%    2
Maple         0,0%    0
Mathematica         0,0%    0
Matlab/Octave        5,0%    2
Perl         0,0%    0
Python        37,5%    15
R/S-Plus        35,0%    14
SAS        2,5%    1
SPSS         0,0%    0
Stata         0,0%    0
Weka        2,5%    1
Other        10,0%    4



3. Where do you live? (Select the option of your political mainland country: e.g., Canary Islands - Spain - Europe (South) )

America (North - Canada)        2,5%    1
America (North - US)        42,5%    17
America (North - Mexico)         0,0%    0
America (Central)         0,0%    0
America (South - Brazil)         0,0%    0
America (South - Argentina)         0,0%    0
America (South - Others)         0,0%    0
Africa (East)         0,0%    0
Africa (Ecuatorial)         0,0%    0
Africa (Mediterranean including Egypt)         0,0%    0
Africa (Sahara)         0,0%    0
Africa (South Africa)        2,5%    1
Africa (West)         0,0%    0
Asia (China)         0,0%    0
Asia (Japan)         0,0%    0
Asia (Korea)        2,5%    1
Asia (India)        5,0%    2
Asia (Middle East)        2,5%    1
Asia (Europe - Russia)        2,5%    1
Asia (Other)        2,5%    1
Europe (Central)        10,0%    4
Europe (East)        2,5%    1
Europe (Islands)         0,0%    0
Europe (North)        10,0%    4
Europe (South)        5,0%    2
Oceania        10,0%    4



4. Where do you originally come from?

America (North - Canada)         0,0%    0
America (North - US)        35,0%    14
America (North - Mexico)         0,0%    0
America (Central)         0,0%    0
America (South - Brazil)         0,0%    0
America (South - Argentina)         0,0%    0
America (South - Others)         0,0%    0
Africa (East)         0,0%    0
Africa (Ecuatorial)         0,0%    0
Africa (Mediterranean including Egypt)         0,0%    0
Africa (Sahara)         0,0%    0
Africa (South Africa)        2,5%    1
Africa (West)         0,0%    0
Asia (China)        5,0%    2
Asia (Japan)         0,0%    0
Asia (Korea)        2,5%    1
Asia (India)        7,5%    3
Asia (Middle East)         0,0%    0
Asia (Europe - Russia)        2,5%    1
Asia (Other)        2,5%    1
Europe (Central)        10,0%    4
Europe (East)        7,5%    3
Europe (Islands)         0,0%    0
Europe (North)        7,5%    3
Europe (South)        10,0%    4
Oceania        7,5%    3



5. Where did you study?

America (North - Canada)         0,0%    0
America (North - US)        42,5%    17
America (North - Mexico)         0,0%    0
America (Central)         0,0%    0
America (South - Brazil)         0,0%    0
America (South - Argentina)         0,0%    0
America (South - Others)         0,0%    0
Africa (East)         0,0%    0
Africa (Ecuatorial)         0,0%    0
Africa (Mediterranean including Egypt)         0,0%    0
Africa (Sahara)         0,0%    0
Africa (South Africa)        2,5%    1
Africa (West)         0,0%    0
Asia (China)         0,0%    0
Asia (Japan)         0,0%    0
Asia (Korea)        2,5%    1
Asia (India)        7,5%    3
Asia (Middle East)        2,5%    1
Asia (Europe - Russia)        2,5%    1
Asia (Other)        2,5%    1
Europe (Central)        7,5%    3
Europe (East)        2,5%    1
Europe (Islands)        2,5%    1
Europe (North)        10,0%    4
Europe (South)        7,5%    3
Oceania        7,5%    3



6. What are the hardware/software configurations you use? (Mark the hardware you perfrom your data computations on, not the one you have i.e., do not mark GPU if you use it only for gaming and you don't perform data analysis on GPU.

Apple MacIntosh        20,0%    7
Cloud (Amazon)        5,7%    2
Cloud (Other)         0,0%    0
GPU (ATI)         0,0%    0
GPU (Nvidia)        14,3%    5
CPU (AMD/K10)         0,0%    0
CPU (AMD/Bulldozer)        2,9%    1
CPU (AMD/Bobcat)        2,9%    1
CPU (Intel/i3)        5,7%    2
CPU (Intel/i5)        37,1%    13
CPU (Intel/i7)        37,1%    13
CPU (Intel/Ivy Bridge)        8,6%    3
CPU (Intel/Sandy Bridge)        11,4%    4
CPU (Intel/Other)        8,6%    3
CPU (Other)        5,7%    2



7. What OS/browser(s) do you use?

Linux (Chrome)        22,9%    8
Linux (Chrominium)        2,9%    1
Linux (Firefox)        17,1%    6
Linux (Opera)         0,0%    0
Linux (Other)         0,0%    0
OSX (Chrome)        20,0%    7
OSX (Chrominium)         0,0%    0
OSX (Firefox)         0,0%    0
OSX (Other)         0,0%    0
OSX (Safari)        2,9%    1
Windows (Chrome)        54,3%    19
Windows (Chrominium)         0,0%    0
Windows (Firefox)        17,1%    6
Windows (Other)        5,7%    2
Windows (Safari)         0,0%    0
Other OS (Chrome)         0,0%    0
Other OS (Chrominium)        2,9%    1
Other OS (Firefox)         0,0%    0
Other OS (Other)         0,0%    0
Other OS (Safari)         0,0%    0



8. Have you used any Hadoop-related tools for any data analysis?

Cassandra         0,0%    0
Lucene         0,0%    0
Hadoop        77,8%    7
Mahout        22,2%    2
Hama         0,0%    0
HBase         0,0%    0
Hive        22,2%    2
Pig        44,4%    4



9. What is the Machine Learning technique that you generally find most useful for classification/regression?

Adaboost        3,2%    1
Bayesian Networks        3,2%    1
kNN         0,0%    0
Linear Regression (Lasso/ElasticNet)        3,2%    1
Linear Regression (OLS/Ridge/other regularized)        3,2%    1
Linear Regression (Other)         0,0%    0
Linear SVC/SVR         0,0%    0
Logistic Regression        6,5%    2
Naive Bayes         0,0%    0
Neural Networks        12,9%    4
Random Forests        67,7%    21
SVM/SVR (Non-linear kernel)         0,0%    0



10. According to you, Machine Learning is mostly?

Engineering/Algorithmics        14,3%    5
Engineering/Algorithmics and Optimization        34,3%    12
Mathematics        5,7%    2
Optimization        2,9%    1
Physics         0,0%    0
Programming        5,7%    2
Statistics and Probability Theory        37,1%    13

Wednesday, September 11, 2013

Load Kaggle datasets directly into Amazon EC2

Despite not having access to a suitable environment at home, I decided to enter a new Kaggle competition. The StumbleUpon Evergreen Classification Challenge seems to be easy to tackle since it is a classic binary classification problem with text features and numerical features.

I decided to do it on the cloud. For that purpose, one needs to load the data distributed by Kaggle into the Amazon EC2 instance. Kaggle will prevent any connection from there, since they require you to log in to access the data. No problem, it is the cookies which do the work, and we are going to use them from the EC2 instance, as they commented here

The first thing we need is a plugin to save the cookies into a text file. Use this for Firefox, and this for Chrome.

Then, we upload the file to the EC2 instance with some means. In my case I use Bittorrent Sync (a post will be coming later on). We tell wget to use the cookies with the option --load-cookies as this:

wget -x --load-cookies ~/BTSync/cookies.txt http://www.kaggle.com/c/stumbleupon/download/raw_content.zip

We get an output such as this, and we have successfully loaded the data:

ubuntu@ip-172-31-21-138:~/kaggle/evergreen$ wget -x --load-cookies ~/BTSync/cookies.txt http://www.kaggle.com/c/stumbleupon/download/raw_content.zip
--2013-09-09 22:37:17--  http://www.kaggle.com/c/stumbleupon/download/raw_content.zip
Resolving www.kaggle.com (www.kaggle.com)... 168.62.224.124
Connecting to www.kaggle.com (www.kaggle.com)|168.62.224.124|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://kaggle2.blob.core.windows.net/competitions-data/kaggle/3526/raw_content.zip?sv=2012-02-12&se=2013-09-12T22%3A37%3A18Z&sr=b&sp=r&sig=qAJZIFUmRu%2B9XX%2FM%2B7qPorR%2FkWAC7%2B9W6MEWL5xM0fg%3D [following]
--2013-09-09 22:37:18--  https://kaggle2.blob.core.windows.net/competitions-data/kaggle/3526/raw_content.zip?sv=2012-02-12&se=2013-09-12T22%3A37%3A18Z&sr=b&sp=r&sig=qAJZIFUmRu%2B9XX%2FM%2B7qPorR%2FkWAC7%2B9W6MEWL5xM0fg%3D
Resolving kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)... 65.52.106.46
Connecting to kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)|65.52.106.46|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 164757969 (157M) [application/zip]
Saving to: ‘www.kaggle.com/c/stumbleupon/download/raw_content.zip’

100%[======================================>] 164,757,969 2.29MB/s   in 95s

2013-09-09 22:38:53 (1.65 MB/s) - ‘www.kaggle.com/c/stumbleupon/download/raw_content.zip’ saved [164757969/164757969]

ubuntu@ip-172-31-21-138:~/kaggle/evergreen$

Monday, September 9, 2013

Stuff: Cook computers in the oven to resucitate them and cool Windows account hacking

I wanted to share a couple of experiences with computers that I have had in the period of one week.

A couple of weeks ago a number of friends and I went on vacations and we rented a house in the middle of nowhere. We brought a Wifi router and our laptops and played a very old computer game that everybody still loves, which is Age of Mythology.

At the end of our stay, one of my laptops (the old one) stopped working, and my friends told me the last thing it did was painting strange lines along with the graphics, i.e., a graphics card issue. I must say that this comes at an already tumultous time between me and technology, since my new laptop just died some time ago and it's in repair.

When we got home, I looked for information on the possible issue. And I came across this

The funny thing is: IT WORKS!

We removed the NVidia video card, put it in the oven as it says, 8 min at 200 celsius, put it back on and voila! Graphics back!

The next experience was: I was given my sister's old laptop. It is old but still decent and she only changed laptop because the disk crashed... I took it and recovered her files later on. Anyway, it must be the two-year long rest but the disk is living now. The only problem: we could not access the laptop because of her password.


Here's Youtube to the rescue again. This, however, semms to only work with Windows 7 and 8, not with Vista, the one on the laptop. Anyway, we got the disk out, put it on my external drive case, and accessed the files from there. This hack works like this:
  • The accesibility button on the Windows log in screen (bottom left) calls a program called utilman.exe, located under C:\Windows\System32
  • Accessing the drive, we rename it to utilman1.exe and make a new copy of cmd.exe to utilman.exe, therefore utilman.exe is now a command prompt.
  • When we put the disk on again and started Windows, pressing the accessibility button shows a privileged command prompt. From there, we issued the command net user myuser x, effectively resseting the password to a plain x.
This left us with a ready to use laptop (not critial tasks, just online gaming), just what we needed for the afternoon

Addendum: I believe it is interesting if I mark the post with the laptop that was reflowed with this technique: This is an Acer 5920g and the NVidia card was a 8600M GT. Apparently, these NVidia processors (prior to 2009) have a problem with the heat from using them and the cooling down when not in use: This makes internal circuits break, much like a stone exposed to the sun. Heating them to over 200 celsius make the connections sufficiently liquid so as to resolder again.

Sunday, September 8, 2013

Video: Wes McKinney about time data handling in Pandas


Interesting stuff. Haven't seen it yet, came across it while searching for Time Series Analysis in Python (but this has nothing to do with ARIMA models). They've worked out timezones and other time data handling operations pretty well.