## Wednesday, September 11, 2013

### Load Kaggle datasets directly into Amazon EC2

Despite not having access to a suitable environment at home, I decided to enter a new Kaggle competition. The StumbleUpon Evergreen Classification Challenge seems to be easy to tackle since it is a classic binary classification problem with text features and numerical features.

I decided to do it on the cloud. For that purpose, one needs to load the data distributed by Kaggle into the Amazon EC2 instance. Kaggle will prevent any connection from there, since they require you to log in to access the data. No problem, it is the cookies which do the work, and we are going to use them from the EC2 instance, as they commented here

The first thing we need is a plugin to save the cookies into a text file. Use this for Firefox, and this for Chrome.

Then, we upload the file to the EC2 instance with some means. In my case I use Bittorrent Sync (a post will be coming later on). We tell wget to use the cookies with the option --load-cookies as this:

We get an output such as this, and we have successfully loaded the data:

ubuntu@ip-172-31-21-138:~/kaggle/evergreen$wget -x --load-cookies ~/BTSync/cookies.txt http://www.kaggle.com/c/stumbleupon/download/raw_content.zip --2013-09-09 22:37:17-- http://www.kaggle.com/c/stumbleupon/download/raw_content.zip Resolving www.kaggle.com (www.kaggle.com)... 168.62.224.124 Connecting to www.kaggle.com (www.kaggle.com)|168.62.224.124|:80... connected. HTTP request sent, awaiting response... 302 Found Location: https://kaggle2.blob.core.windows.net/competitions-data/kaggle/3526/raw_content.zip?sv=2012-02-12&se=2013-09-12T22%3A37%3A18Z&sr=b&sp=r&sig=qAJZIFUmRu%2B9XX%2FM%2B7qPorR%2FkWAC7%2B9W6MEWL5xM0fg%3D [following] --2013-09-09 22:37:18-- https://kaggle2.blob.core.windows.net/competitions-data/kaggle/3526/raw_content.zip?sv=2012-02-12&se=2013-09-12T22%3A37%3A18Z&sr=b&sp=r&sig=qAJZIFUmRu%2B9XX%2FM%2B7qPorR%2FkWAC7%2B9W6MEWL5xM0fg%3D Resolving kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)... 65.52.106.46 Connecting to kaggle2.blob.core.windows.net (kaggle2.blob.core.windows.net)|65.52.106.46|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 164757969 (157M) [application/zip] Saving to: ‘www.kaggle.com/c/stumbleupon/download/raw_content.zip’ 100%[======================================>] 164,757,969 2.29MB/s in 95s 2013-09-09 22:38:53 (1.65 MB/s) - ‘www.kaggle.com/c/stumbleupon/download/raw_content.zip’ saved [164757969/164757969] ubuntu@ip-172-31-21-138:~/kaggle/evergreen$

#### 1 comment:

1. Thank you, very useful.