The next are the results of the Kaggle survey that I conducted, in which several questions regarding Machine Learning/Data Analysis were asked to participants to extract their personal views on the subject and the tools they used. The online platform on which the survey ran does not offer much analytics beyond copying and pasting aggregated results per question, so here you go:
(the link http://es.surveymonkey.com/s/SYYTCF2 )
TOTAL PARTICIPANTS: 40
1. What is your background?
Biochemistry 0,0% 0
Chemistry 0,0% 0
Computer Engineering (Software Development) 30,0% 12
Computer Science (IA/Machine Learning) 12,5% 5
Econometrics 0,0% 0
Economics 5,0% 2
Engineering (Electrical) 5,0% 2
Engineering (Mechanical) 0,0% 0
Engineering (Other) 0,0% 0
Mathematics 15,0% 6
Medicine 0,0% 0
Physics 7,5% 3
Statistics 12,5% 5
Other (Science Applied) 7,5% 3
Other (Science Pure) 0,0% 0
Other 5,0% 2
2. What is your language of preferred usage for data analysis tasks?
Bash/sed/awk/any shell 0,0% 0
C/C++ 2,5% 1
Excel 0,0% 0
Java 5,0% 2
Maple 0,0% 0
Mathematica 0,0% 0
Matlab/Octave 5,0% 2
Perl 0,0% 0
Python 37,5% 15
R/S-Plus 35,0% 14
SAS 2,5% 1
SPSS 0,0% 0
Stata 0,0% 0
Weka 2,5% 1
Other 10,0% 4
3. Where do you live? (Select the option of your political mainland country: e.g., Canary Islands - Spain - Europe (South) )
America (North - Canada) 2,5% 1
America (North - US) 42,5% 17
America (North - Mexico) 0,0% 0
America (Central) 0,0% 0
America (South - Brazil) 0,0% 0
America (South - Argentina) 0,0% 0
America (South - Others) 0,0% 0
Africa (East) 0,0% 0
Africa (Ecuatorial) 0,0% 0
Africa (Mediterranean including Egypt) 0,0% 0
Africa (Sahara) 0,0% 0
Africa (South Africa) 2,5% 1
Africa (West) 0,0% 0
Asia (China) 0,0% 0
Asia (Japan) 0,0% 0
Asia (Korea) 2,5% 1
Asia (India) 5,0% 2
Asia (Middle East) 2,5% 1
Asia (Europe - Russia) 2,5% 1
Asia (Other) 2,5% 1
Europe (Central) 10,0% 4
Europe (East) 2,5% 1
Europe (Islands) 0,0% 0
Europe (North) 10,0% 4
Europe (South) 5,0% 2
Oceania 10,0% 4
4. Where do you originally come from?
America (North - Canada) 0,0% 0
America (North - US) 35,0% 14
America (North - Mexico) 0,0% 0
America (Central) 0,0% 0
America (South - Brazil) 0,0% 0
America (South - Argentina) 0,0% 0
America (South - Others) 0,0% 0
Africa (East) 0,0% 0
Africa (Ecuatorial) 0,0% 0
Africa (Mediterranean including Egypt) 0,0% 0
Africa (Sahara) 0,0% 0
Africa (South Africa) 2,5% 1
Africa (West) 0,0% 0
Asia (China) 5,0% 2
Asia (Japan) 0,0% 0
Asia (Korea) 2,5% 1
Asia (India) 7,5% 3
Asia (Middle East) 0,0% 0
Asia (Europe - Russia) 2,5% 1
Asia (Other) 2,5% 1
Europe (Central) 10,0% 4
Europe (East) 7,5% 3
Europe (Islands) 0,0% 0
Europe (North) 7,5% 3
Europe (South) 10,0% 4
Oceania 7,5% 3
5. Where did you study?
America (North - Canada) 0,0% 0
America (North - US) 42,5% 17
America (North - Mexico) 0,0% 0
America (Central) 0,0% 0
America (South - Brazil) 0,0% 0
America (South - Argentina) 0,0% 0
America (South - Others) 0,0% 0
Africa (East) 0,0% 0
Africa (Ecuatorial) 0,0% 0
Africa (Mediterranean including Egypt) 0,0% 0
Africa (Sahara) 0,0% 0
Africa (South Africa) 2,5% 1
Africa (West) 0,0% 0
Asia (China) 0,0% 0
Asia (Japan) 0,0% 0
Asia (Korea) 2,5% 1
Asia (India) 7,5% 3
Asia (Middle East) 2,5% 1
Asia (Europe - Russia) 2,5% 1
Asia (Other) 2,5% 1
Europe (Central) 7,5% 3
Europe (East) 2,5% 1
Europe (Islands) 2,5% 1
Europe (North) 10,0% 4
Europe (South) 7,5% 3
Oceania 7,5% 3
6. What are the hardware/software configurations you use? (Mark the hardware you perfrom your data computations on, not the one you have i.e., do not mark GPU if you use it only for gaming and you don't perform data analysis on GPU.
Apple MacIntosh 20,0% 7
Cloud (Amazon) 5,7% 2
Cloud (Other) 0,0% 0
GPU (ATI) 0,0% 0
GPU (Nvidia) 14,3% 5
CPU (AMD/K10) 0,0% 0
CPU (AMD/Bulldozer) 2,9% 1
CPU (AMD/Bobcat) 2,9% 1
CPU (Intel/i3) 5,7% 2
CPU (Intel/i5) 37,1% 13
CPU (Intel/i7) 37,1% 13
CPU (Intel/Ivy Bridge) 8,6% 3
CPU (Intel/Sandy Bridge) 11,4% 4
CPU (Intel/Other) 8,6% 3
CPU (Other) 5,7% 2
7. What OS/browser(s) do you use?
Linux (Chrome) 22,9% 8
Linux (Chrominium) 2,9% 1
Linux (Firefox) 17,1% 6
Linux (Opera) 0,0% 0
Linux (Other) 0,0% 0
OSX (Chrome) 20,0% 7
OSX (Chrominium) 0,0% 0
OSX (Firefox) 0,0% 0
OSX (Other) 0,0% 0
OSX (Safari) 2,9% 1
Windows (Chrome) 54,3% 19
Windows (Chrominium) 0,0% 0
Windows (Firefox) 17,1% 6
Windows (Other) 5,7% 2
Windows (Safari) 0,0% 0
Other OS (Chrome) 0,0% 0
Other OS (Chrominium) 2,9% 1
Other OS (Firefox) 0,0% 0
Other OS (Other) 0,0% 0
Other OS (Safari) 0,0% 0
8. Have you used any Hadoop-related tools for any data analysis?
Cassandra 0,0% 0
Lucene 0,0% 0
Hadoop 77,8% 7
Mahout 22,2% 2
Hama 0,0% 0
HBase 0,0% 0
Hive 22,2% 2
Pig 44,4% 4
9. What is the Machine Learning technique that you generally find most useful for classification/regression?
Adaboost 3,2% 1
Bayesian Networks 3,2% 1
kNN 0,0% 0
Linear Regression (Lasso/ElasticNet) 3,2% 1
Linear Regression (OLS/Ridge/other regularized) 3,2% 1
Linear Regression (Other) 0,0% 0
Linear SVC/SVR 0,0% 0
Logistic Regression 6,5% 2
Naive Bayes 0,0% 0
Neural Networks 12,9% 4
Random Forests 67,7% 21
SVM/SVR (Non-linear kernel) 0,0% 0
10. According to you, Machine Learning is mostly?
Engineering/Algorithmics 14,3% 5
Engineering/Algorithmics and Optimization 34,3% 12
Mathematics 5,7% 2
Optimization 2,9% 1
Physics 0,0% 0
Programming 5,7% 2
Statistics and Probability Theory 37,1% 13
Sunday, September 29, 2013
Wednesday, September 11, 2013
Load Kaggle datasets directly into Amazon EC2
Despite not having access to a suitable environment at home, I decided to enter a new Kaggle competition. The StumbleUpon Evergreen Classification Challenge seems to be easy to tackle since it is a classic binary classification problem with text features and numerical features.
I decided to do it on the cloud. For that purpose, one needs to load the data distributed by Kaggle into the Amazon EC2 instance. Kaggle will prevent any connection from there, since they require you to log in to access the data. No problem, it is the cookies which do the work, and we are going to use them from the EC2 instance, as they commented here
The first thing we need is a plugin to save the cookies into a text file. Use this for Firefox, and this for Chrome.
Then, we upload the file to the EC2 instance with some means. In my case I use Bittorrent Sync (a post will be coming later on). We tell wget to use the cookies with the option --load-cookies as this:
wget -x --load-cookies ~/BTSync/cookies.txt http://www.kaggle.com/c/stumbleupon/download/raw_content.zip
We get an output such as this, and we have successfully loaded the data:
ubuntu@ip-172-31-21-138:~/kaggle/evergreen$ wget -x --load-cookies ~/BTSync/cookies.txt http://www.kaggle.com/c/stumbleupon/download/raw_content.zip
--2013-09-09 22:37:17-- http://www.kaggle.com/c/stumbleupon/download/raw_content.zip
Resolving www.kaggle.com (www.kaggle.com)... 168.62.224.124
Connecting to www.kaggle.com (www.kaggle.com)|168.62.224.124|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://kaggle2.blob.core.windows.net/competitions-data/kaggle/3526/raw_content.zip?sv=2012-02-12&se=2013-09-12T22%3A37%3A18Z&sr=b&sp=r&sig=qAJZIFUmRu%2B9XX%2FM%2B7qPorR%2FkWAC7%2B9W6MEWL5xM0fg%3D [following]
--2013-09-09 22:37:18-- https://kaggle2.blob.core.windows.net/competitions-data/kaggle/3526/raw_content.zip?sv=2012-02-12&se=2013-09-12T22%3A37%3A18Z&sr=b&sp=r&sig=qAJZIFUmRu%2B9XX%2FM%2B7qPorR%2FkWAC7%2B9W6MEWL5xM0fg%3D
Resolving kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)... 65.52.106.46
Connecting to kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)|65.52.106.46|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 164757969 (157M) [application/zip]
Saving to: ‘www.kaggle.com/c/stumbleupon/download/raw_content.zip’
100%[======================================>] 164,757,969 2.29MB/s in 95s
2013-09-09 22:38:53 (1.65 MB/s) - ‘www.kaggle.com/c/stumbleupon/download/raw_content.zip’ saved [164757969/164757969]
ubuntu@ip-172-31-21-138:~/kaggle/evergreen$
I decided to do it on the cloud. For that purpose, one needs to load the data distributed by Kaggle into the Amazon EC2 instance. Kaggle will prevent any connection from there, since they require you to log in to access the data. No problem, it is the cookies which do the work, and we are going to use them from the EC2 instance, as they commented here
The first thing we need is a plugin to save the cookies into a text file. Use this for Firefox, and this for Chrome.
Then, we upload the file to the EC2 instance with some means. In my case I use Bittorrent Sync (a post will be coming later on). We tell wget to use the cookies with the option --load-cookies as this:
wget -x --load-cookies ~/BTSync/cookies.txt http://www.kaggle.com/c/stumbleupon/download/raw_content.zip
We get an output such as this, and we have successfully loaded the data:
ubuntu@ip-172-31-21-138:~/kaggle/evergreen$ wget -x --load-cookies ~/BTSync/cookies.txt http://www.kaggle.com/c/stumbleupon/download/raw_content.zip
--2013-09-09 22:37:17-- http://www.kaggle.com/c/stumbleupon/download/raw_content.zip
Resolving www.kaggle.com (www.kaggle.com)... 168.62.224.124
Connecting to www.kaggle.com (www.kaggle.com)|168.62.224.124|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://kaggle2.blob.core.windows.net/competitions-data/kaggle/3526/raw_content.zip?sv=2012-02-12&se=2013-09-12T22%3A37%3A18Z&sr=b&sp=r&sig=qAJZIFUmRu%2B9XX%2FM%2B7qPorR%2FkWAC7%2B9W6MEWL5xM0fg%3D [following]
--2013-09-09 22:37:18-- https://kaggle2.blob.core.windows.net/competitions-data/kaggle/3526/raw_content.zip?sv=2012-02-12&se=2013-09-12T22%3A37%3A18Z&sr=b&sp=r&sig=qAJZIFUmRu%2B9XX%2FM%2B7qPorR%2FkWAC7%2B9W6MEWL5xM0fg%3D
Resolving kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)... 65.52.106.46
Connecting to kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)|65.52.106.46|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 164757969 (157M) [application/zip]
Saving to: ‘www.kaggle.com/c/stumbleupon/download/raw_content.zip’
100%[======================================>] 164,757,969 2.29MB/s in 95s
2013-09-09 22:38:53 (1.65 MB/s) - ‘www.kaggle.com/c/stumbleupon/download/raw_content.zip’ saved [164757969/164757969]
ubuntu@ip-172-31-21-138:~/kaggle/evergreen$
Monday, September 9, 2013
Stuff: Cook computers in the oven to resucitate them and cool Windows account hacking
I wanted to share a couple of experiences with computers that I have had in the period of one week.
A couple of weeks ago a number of friends and I went on vacations and we rented a house in the middle of nowhere. We brought a Wifi router and our laptops and played a very old computer game that everybody still loves, which is Age of Mythology.
At the end of our stay, one of my laptops (the old one) stopped working, and my friends told me the last thing it did was painting strange lines along with the graphics, i.e., a graphics card issue. I must say that this comes at an already tumultous time between me and technology, since my new laptop just died some time ago and it's in repair.
When we got home, I looked for information on the possible issue. And I came across this
The funny thing is: IT WORKS!
We removed the NVidia video card, put it in the oven as it says, 8 min at 200 celsius, put it back on and voila! Graphics back!
The next experience was: I was given my sister's old laptop. It is old but still decent and she only changed laptop because the disk crashed... I took it and recovered her files later on. Anyway, it must be the two-year long rest but the disk is living now. The only problem: we could not access the laptop because of her password.
Here's Youtube to the rescue again. This, however, semms to only work with Windows 7 and 8, not with Vista, the one on the laptop. Anyway, we got the disk out, put it on my external drive case, and accessed the files from there. This hack works like this:
Addendum: I believe it is interesting if I mark the post with the laptop that was reflowed with this technique: This is an Acer 5920g and the NVidia card was a 8600M GT. Apparently, these NVidia processors (prior to 2009) have a problem with the heat from using them and the cooling down when not in use: This makes internal circuits break, much like a stone exposed to the sun. Heating them to over 200 celsius make the connections sufficiently liquid so as to resolder again.
A couple of weeks ago a number of friends and I went on vacations and we rented a house in the middle of nowhere. We brought a Wifi router and our laptops and played a very old computer game that everybody still loves, which is Age of Mythology.
At the end of our stay, one of my laptops (the old one) stopped working, and my friends told me the last thing it did was painting strange lines along with the graphics, i.e., a graphics card issue. I must say that this comes at an already tumultous time between me and technology, since my new laptop just died some time ago and it's in repair.
When we got home, I looked for information on the possible issue. And I came across this
The funny thing is: IT WORKS!
We removed the NVidia video card, put it in the oven as it says, 8 min at 200 celsius, put it back on and voila! Graphics back!
The next experience was: I was given my sister's old laptop. It is old but still decent and she only changed laptop because the disk crashed... I took it and recovered her files later on. Anyway, it must be the two-year long rest but the disk is living now. The only problem: we could not access the laptop because of her password.
Here's Youtube to the rescue again. This, however, semms to only work with Windows 7 and 8, not with Vista, the one on the laptop. Anyway, we got the disk out, put it on my external drive case, and accessed the files from there. This hack works like this:
- The accesibility button on the Windows log in screen (bottom left) calls a program called utilman.exe, located under C:\Windows\System32
- Accessing the drive, we rename it to utilman1.exe and make a new copy of cmd.exe to utilman.exe, therefore utilman.exe is now a command prompt.
- When we put the disk on again and started Windows, pressing the accessibility button shows a privileged command prompt. From there, we issued the command net user myuser x, effectively resseting the password to a plain x.
Addendum: I believe it is interesting if I mark the post with the laptop that was reflowed with this technique: This is an Acer 5920g and the NVidia card was a 8600M GT. Apparently, these NVidia processors (prior to 2009) have a problem with the heat from using them and the cooling down when not in use: This makes internal circuits break, much like a stone exposed to the sun. Heating them to over 200 celsius make the connections sufficiently liquid so as to resolder again.
Sunday, September 8, 2013
Video: Wes McKinney about time data handling in Pandas
Interesting stuff. Haven't seen it yet, came across it while searching for Time Series Analysis in Python (but this has nothing to do with ARIMA models). They've worked out timezones and other time data handling operations pretty well.
Subscribe to:
Posts (Atom)