Tools for data science
When it comes to a large data set, things don't seem to be easy. Say, you have a data set with 150 millions records (observations) each of which has attributes of different types (Boolean, double, text, level, etc.). The size of the file can be as large as 5GB, and that might take a while if you open it by spreadsheet. I found that command line in terminal can be our best friend in such situation. Here I list some nice tools data scientists usually use in their routine jobs.
To deal with data in terminal...
Basic commands in terminal
This url contains a very nice list of commands with concise description and example.
http://www.ee.surrey.ac.uk/Teaching/Unix/unix2.html
Split and merge files
using split
and cat
http://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts
Parsing the first column of a csv file to a new file
cat in.csv | cut -d, -f1
http://stackoverflow.com/a/2652526/1613297
http://lowfatlinux.com/linux-columns-cut.html
View contents of a file cat, head and tail
$ cat filename.ext | less #to view the file page by page until the end. You can quit by pressing q
$ head -20 filename.ext #To view the first 20 lines of the file
$ tail -30 filename.ext #To view the last 30 lines of the file
More information can be found here:
tail: http://kb.iu.edu/data/acrj.html
Sort unique file by some id
http://stackoverflow.com/questions/13454464/sort-unix-file-by-id
Sort text file by the last field value/attribute
http://stackoverflow.com/a/3832236/1613297
Search string in a text file
Often, you will want to search for some string in a huge text file just to double check if your data is obtained correctly or not. The command grep
is super useful.
grep --color 'CO.,LTD' someCompanyList.csv
http://www.cyberciti.biz/faq/howto-search-find-file-for-text-string/
http://www.cyberciti.biz/faq/howto-use-grep-command-in-linux-unix/
String replacement
sed
is very good for this.
Example
My input.txt
looks like this
12345 Myroad, Seattle, TX 76134
and I want to replace ', ' (comma+space) with '$$$***' I will use the command
$ sed -e 's/, /$$$***/g' input.txt > output.txt
and the output file output.txt would look like
12345 Myroad$$$***Seattle$$$***TX 76134
However, it is not as easy as before when we want to return it back to the original. Yet, here is how:
$ sed -e 's/\$\$\$\*\*\*/, /g' output.txt > output_back.txt
That's because $
and *
are both the escape character, so need a preceding backslash '\
'.
http://www.grymoire.com/Unix/Sed.html#uh-6
Edit file contents
vim
is dearly loved! I remember the first time using vim is not a nice smooth ride, but I'm loving it very much right now. I bet you too will do.
http://www.tuxfiles.org/linuxhelp/vimcheat.html
To log/record all the input and output from the terminal
To start logging use the command
$ script mylog.text
and to end, use
$ exit
http://askubuntu.com/questions/161935/how-do-i-log-all-input-and-output-in-a-terminal-session
To remote connect to a computer
use ssh
ssh mylogin@host.machine.name
To transfer files from remote computer
We might want to use scp
to transfer file between 2 computers.
To copy file from remote machine to local machine:
scp mylogin@host.machine.name:/path/on/remote/machine/myfile.csv /path/on/local/machine/
To copy file from local machine to the remote machine:
scp /path/on/local/machine/myfile.csv mylogin@host.machine.name:/path/on/remote/machine/
http://ged.msu.edu/angus/tutorials/using-ssh-scp-terminal-macosx.html
http://stackoverflow.com/questions/14325956/using-scp-in-terminal
To view and kill running jobs
# to list all the jobs use
$ jobs -p
# To kill any job ID=12345 use
$ kill -9 12345
http://www.cyberciti.biz/faq/linux-unix-appleosx-bsd-kill-all-jobs-under-bash-ksh-sh-shell/
Perl
Most of the time, you will have to deal with text data, and Perl is a great tool for that.
Convert between attribute-value table and flat table
This is a very good example.
How do you know the version of your Perl?
$ perl -v
Install some module to Perl
First, you might want to install cpanm
which makes installation easier in the long run:
$ cpan App::cpanminus
When you know which module to install, just use:
$ cpanm Module::Name
For example, in my code, I have
use Text::CSV;
Which means my program requires the module called Text::CSV, and to install the module, I type this in the terminal:
$ cpan Text::CSV
Please refer to http://www.cpan.org/modules/INSTALL.html for more details.
Some nice tutorials on Perl can be found here: http://perl-tutorial.org/
If the module is not found on your computer, try this
http://perlmaven.com/how-to-change-inc-to-find-perl-modules-in-non-standard-locations
Others
How to set the page switching (Alt + Tab) in the NoMachine
http://www.nomachine.com/ar/view.php?ar_id=AR04C00174
Cron
If you want to run some scripts for a specific time periodically, you might want to use cron. Here is a short tutorial on how to use cron.
http://www.thesitewizard.com/general/set-cron-job.shtml
http://stackoverflow.com/questions/8986554/fire-job-every-90-min-continuously?lq=1