Tools for data science

When it comes to a large data set, things don't seem to be easy. Say, you have a data set with 150 millions records (observations) each of which has attributes of different types (Boolean, double, text, level, etc.). The size of the file can be as large as 5GB, and that might take a while if you open it by spreadsheet. I found that command line in terminal can be our best friend in such situation. Here I list some nice tools data scientists usually use in their routine jobs.

To deal with data in terminal...

Basic commands in terminal

This url contains a very nice list of commands with concise description and example.

http://www.ee.surrey.ac.uk/Teaching/Unix/unix2.html

Split and merge files

using split and cat

http://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts

Parsing the first column of a csv file to a new file

cat in.csv | cut -d, -f1

http://stackoverflow.com/a/2652526/1613297

http://lowfatlinux.com/linux-columns-cut.html

View contents of a file cat, head and tail

$ cat filename.ext | less #to view the file page by page until the end. You can quit by pressing q
$ head -20 filename.ext #To view the first 20 lines of the file
$ tail -30 filename.ext #To view the last 30 lines of the file

More information can be found here:

tail: http://kb.iu.edu/data/acrj.html

Sort unique file by some id

http://stackoverflow.com/questions/13454464/sort-unix-file-by-id

Sort text file by the last field value/attribute

http://stackoverflow.com/a/3832236/1613297

Search string in a text file

Often, you will want to search for some string in a huge text file just to double check if your data is obtained correctly or not. The command grep is super useful.

grep --color 'CO.,LTD' someCompanyList.csv

http://www.cyberciti.biz/faq/howto-search-find-file-for-text-string/

http://www.cyberciti.biz/faq/howto-use-grep-command-in-linux-unix/

String replacement

sed is very good for this.

Example

My input.txt looks like this

12345 Myroad, Seattle, TX 76134

and I want to replace ', ' (comma+space) with '$$$***' I will use the command

$ sed -e 's/, /$$$***/g' input.txt > output.txt

and the output file output.txt would look like

12345 Myroad$$$***Seattle$$$***TX 76134

However, it is not as easy as before when we want to return it back to the original. Yet, here is how:

$ sed -e 's/\$\$\$\*\*\*/, /g' output.txt > output_back.txt

That's because $ and * are both the escape character, so need a preceding backslash '\'.

http://stackoverflow.com/questions/8495465/using-sed-with-regular-expressions-for-sub-string-replacement

http://www.grymoire.com/Unix/Sed.html#uh-6

Edit file contents

vim is dearly loved! I remember the first time using vim is not a nice smooth ride, but I'm loving it very much right now. I bet you too will do.

http://www.tuxfiles.org/linuxhelp/vimcheat.html

To log/record all the input and output from the terminal

To start logging use the command

$ script mylog.text

and to end, use

$ exit

http://askubuntu.com/questions/161935/how-do-i-log-all-input-and-output-in-a-terminal-session

To remote connect to a computer

use ssh

ssh mylogin@host.machine.name

To transfer files from remote computer

We might want to use scp to transfer file between 2 computers.

To copy file from remote machine to local machine:

scp mylogin@host.machine.name:/path/on/remote/machine/myfile.csv /path/on/local/machine/

To copy file from local machine to the remote machine:

scp /path/on/local/machine/myfile.csv mylogin@host.machine.name:/path/on/remote/machine/

http://ged.msu.edu/angus/tutorials/using-ssh-scp-terminal-macosx.html

http://stackoverflow.com/questions/14325956/using-scp-in-terminal

To view and kill running jobs

# to list all the jobs use
$ jobs -p
# To kill any job ID=12345 use
$ kill -9 12345

http://www.cyberciti.biz/faq/linux-unix-appleosx-bsd-kill-all-jobs-under-bash-ksh-sh-shell/

Perl

Most of the time, you will have to deal with text data, and Perl is a great tool for that.

Convert between attribute-value table and flat table

This is a very good example.

http://stackoverflow.com/questions/17600187/how-to-convert-a-list-of-attribute-value-pairs-into-a-flat-table-whose-columns-a

How do you know the version of your Perl?

$ perl -v

Install some module to Perl

First, you might want to install cpanm which makes installation easier in the long run:

$ cpan App::cpanminus

When you know which module to install, just use:

$ cpanm Module::Name

For example, in my code, I have

use Text::CSV;

Which means my program requires the module called Text::CSV, and to install the module, I type this in the terminal:

$ cpan Text::CSV

Please refer to http://www.cpan.org/modules/INSTALL.html for more details.

Some nice tutorials on Perl can be found here: http://perl-tutorial.org/

If the module is not found on your computer, try this

http://perlmaven.com/how-to-change-inc-to-find-perl-modules-in-non-standard-locations

http://stackoverflow.com/questions/4716918/cant-locate-file-glob-pm-in-inc-inc-contains-d-tools-lib-at-directory-p

Others

How to set the page switching (Alt + Tab) in the NoMachine

http://www.nomachine.com/ar/view.php?ar_id=AR04C00174

Cron

If you want to run some scripts for a specific time periodically, you might want to use cron. Here is a short tutorial on how to use cron.

http://www.thesitewizard.com/general/set-cron-job.shtml

http://stackoverflow.com/questions/8986554/fire-job-every-90-min-continuously?lq=1

MyWorkLog