Post date: Aug 25, 2017 7:03:42 PM
Our goal here is to evaluate the data quality from multiple data providers. At the end of the day we want to be able to recommend which data provider(s) to subscribe to.
A data provider normally shares files with us via ftp or sftp. We want to copy the data to our database, in this case I create S3 buckets to store such data. My plan is to use lftp (please refer to Appendix A: LFTP) to copy the data from remote directory (provide end) to an EC2 instance, where the data is processed and reformatted before moving to S3 using boto package (in python).
The following scripts can be helpful for data processing
# Screen, so that you can leave from the working console without terminating the running tasks> screen> screen -ls> screen -r <id># install lftp> sudo yum install lftp# lftp interactive mode> lftp sftp://<user>:<password>@<host>lftp> mirror --use-pget-n=5 <folder># scriptlftp sftp://<user>:<password>@<host> -e "mirror --use-pget-n=20 <folder>; bye"lftp sftp://<user>:<password>@<host> -e "pget -n <number_of_segments> <folder>; bye"# Copy data to S3aws s3 cp <folder> s3://<bucket_name>/<folder> --recursive --region us-west-2# copy specific extension to s3aws s3 cp ./ s3://<bucket name>/<folders>/ --recursive --region us-west-2 --exclude '*' --include '*.gz'# Copy data from S3aws s3 cp s3://<bucket_name>/<path>/ ./ --recursive --region us-west-2# how to split files into small piecesplit -C 50M -a 8 -d <your_big_file> <output_prefix># how to remove the headerhead -n 1 <original_file> > <header_file>tail -n +2 <original_file> > <no_header># how to gzip themgzip <filename># List files in directoryaws s3 ls --human-readable s3://<your s3 path># split file and gzipsearch_dir=/tmp/raw_data/for eachfile in `ls $search_dir | grep .TXT`; do filename=$(basename "$eachfile"); filename="${filename%.TXT}"; echo spliting $filename.TXT; split -C 100M -a 8 -d $search_dir$filename.TXT ${filename}_ --verbose for entry in `ls | grep "${filename}_"`; do entryname=$(basename "$entry"); echo gzipping $entryname; gzip $entryname; donedone- how to store the data
- It can be difficult if the data format is fixed width, refer to this post:
http://www.hadooptechs.com/hive/fixed-width-files-in-hive
- precision and coverage
https://whatbox.ca/wiki/lftp
https://docs.salixos.org/wiki/How_to_send/get_a_file_to/from_a_remote_server_via_command_line_or_script
https://linoxide.com/linux-how-to/setup-lftp-command-line-ftp/