When it comes to a large data set, things don't seem to be easy. Say, you have a data set with 150 millions records (observations) each of which has attributes of different types (Boolean, double, text, level, etc.). The size of the file can be as large as 5GB, and that might take a while if you open it by spreadsheet. I found that command line in terminal can be our best friend in such situation. Here I list some nice tools data scientists usually use in their routine jobs.
Basic commands in terminal
This url contains a very nice list of commands with concise description and example.
http://www.ee.surrey.ac.uk/Teaching/Unix/unix2.html
Split and merge files
using split and cat
http://askubuntu.com/questions/54579/how-to-split-larger-files-into-smaller-parts
Parsing the first column of a csv file to a new file
cat in.csv | cut -d, -f1http://stackoverflow.com/a/2652526/1613297
http://lowfatlinux.com/linux-columns-cut.html
View contents of a file cat, head and tail
$ cat filename.ext | less #to view the file page by page until the end. You can quit by pressing q$ head -20 filename.ext #To view the first 20 lines of the file$ tail -30 filename.ext #To view the last 30 lines of the fileMore information can be found here:
tail: http://kb.iu.edu/data/acrj.html
Sort unique file by some id
http://stackoverflow.com/questions/13454464/sort-unix-file-by-id
Sort text file by the last field value/attribute
http://stackoverflow.com/a/3832236/1613297
Search string in a text file
Often, you will want to search for some string in a huge text file just to double check if your data is obtained correctly or not. The command grep is super useful.
grep --color 'CO.,LTD' someCompanyList.csvhttp://www.cyberciti.biz/faq/howto-search-find-file-for-text-string/
http://www.cyberciti.biz/faq/howto-use-grep-command-in-linux-unix/
String replacement
sed is very good for this.
Example
My input.txt looks like this
12345 Myroad, Seattle, TX 76134and I want to replace ', ' (comma+space) with '$$$***' I will use the command
$ sed -e 's/, /$$$***/g' input.txt > output.txtand the output file output.txt would look like
12345 Myroad$$$***Seattle$$$***TX 76134However, it is not as easy as before when we want to return it back to the original. Yet, here is how:
$ sed -e 's/\$\$\$\*\*\*/, /g' output.txt > output_back.txtThat's because $ and * are both the escape character, so need a preceding backslash '\'.
http://www.grymoire.com/Unix/Sed.html#uh-6
Edit file contents
vim is dearly loved! I remember the first time using vim is not a nice smooth ride, but I'm loving it very much right now. I bet you too will do.
http://www.tuxfiles.org/linuxhelp/vimcheat.html
To log/record all the input and output from the terminal
To start logging use the command
$ script mylog.textand to end, use
$ exithttp://askubuntu.com/questions/161935/how-do-i-log-all-input-and-output-in-a-terminal-session
To remote connect to a computer
use ssh
ssh mylogin@host.machine.nameTo transfer files from remote computer
We might want to use scp to transfer file between 2 computers.
To copy file from remote machine to local machine:
scp mylogin@host.machine.name:/path/on/remote/machine/myfile.csv /path/on/local/machine/To copy file from local machine to the remote machine:
scp /path/on/local/machine/myfile.csv mylogin@host.machine.name:/path/on/remote/machine/http://ged.msu.edu/angus/tutorials/using-ssh-scp-terminal-macosx.html
http://stackoverflow.com/questions/14325956/using-scp-in-terminal
To view and kill running jobs
# to list all the jobs use$ jobs -p# To kill any job ID=12345 use$ kill -9 12345http://www.cyberciti.biz/faq/linux-unix-appleosx-bsd-kill-all-jobs-under-bash-ksh-sh-shell/
Most of the time, you will have to deal with text data, and Perl is a great tool for that.
Convert between attribute-value table and flat table
This is a very good example.
How do you know the version of your Perl?
$ perl -vInstall some module to Perl
First, you might want to install cpanm which makes installation easier in the long run:
$ cpan App::cpanminusWhen you know which module to install, just use:
$ cpanm Module::NameFor example, in my code, I have
use Text::CSV;Which means my program requires the module called Text::CSV, and to install the module, I type this in the terminal:
$ cpan Text::CSVPlease refer to http://www.cpan.org/modules/INSTALL.html for more details.
Some nice tutorials on Perl can be found here: http://perl-tutorial.org/
If the module is not found on your computer, try this
http://perlmaven.com/how-to-change-inc-to-find-perl-modules-in-non-standard-locations
How to set the page switching (Alt + Tab) in the NoMachine
http://www.nomachine.com/ar/view.php?ar_id=AR04C00174
Cron
If you want to run some scripts for a specific time periodically, you might want to use cron. Here is a short tutorial on how to use cron.
http://www.thesitewizard.com/general/set-cron-job.shtml
http://stackoverflow.com/questions/8986554/fire-job-every-90-min-continuously?lq=1