Evaluating data quality from multiple providers

Post date: Aug 25, 2017 7:03:42 PM

Our goal here is to evaluate the data quality from multiple data providers. At the end of the day we want to be able to recommend which data provider(s) to subscribe to.

Get the data from data providers to our database

A data provider normally shares files with us via ftp or sftp. We want to copy the data to our database, in this case I create S3 buckets to store such data. My plan is to use lftp (please refer to Appendix A: LFTP) to copy the data from remote directory (provide end) to an EC2 instance, where the data is processed and reformatted before moving to S3 using boto package (in python).

The following scripts can be helpful for data processing

# Screen, so that you can leave from the working console without terminating the running tasks
> screen
> screen -ls
> screen -r <id>
# install lftp
> sudo yum install lftp
# lftp interactive mode
> lftp sftp://<user>:<password>@<host>
lftp> mirror --use-pget-n=5 <folder>
# script
lftp sftp://<user>:<password>@<host> -e "mirror --use-pget-n=20 <folder>; bye"
lftp sftp://<user>:<password>@<host> -e "pget -n <number_of_segments> <folder>; bye"
# Copy data to S3
aws s3 cp <folder> s3://<bucket_name>/<folder> --recursive --region us-west-2
# copy specific extension to s3
aws s3 cp ./ s3://<bucket name>/<folders>/ --recursive --region us-west-2 --exclude '*' --include '*.gz'
# Copy data from S3
aws s3 cp s3://<bucket_name>/<path>/ ./ --recursive --region us-west-2
# how to split files into small piece
split -C 50M -a 8 -d <your_big_file> <output_prefix>
# how to remove the header
head -n 1 <original_file> > <header_file>
tail -n +2 <original_file> > <no_header>
# how to gzip them
gzip <filename>
# List files in directory
aws s3 ls --human-readable s3://<your s3 path>
# split file and gzip
search_dir=/tmp/raw_data/
for eachfile in `ls $search_dir | grep .TXT`; do
    filename=$(basename "$eachfile");
    filename="${filename%.TXT}";
    echo spliting $filename.TXT;
    split -C 100M -a 8 -d $search_dir$filename.TXT ${filename}_ --verbose
    for entry in `ls | grep "${filename}_"`; do
        entryname=$(basename "$entry");
        echo gzipping $entryname;
        gzip $entryname;
    done
done

Using Hive to analyze the data

- how to store the data

- It can be difficult if the data format is fixed width, refer to this post:

http://www.hadooptechs.com/hive/fixed-width-files-in-hive

Methodology

- precision and coverage

Appendix A: LFTP

https://whatbox.ca/wiki/lftp

https://docs.salixos.org/wiki/How_to_send/get_a_file_to/from_a_remote_server_via_command_line_or_script

https://linoxide.com/linux-how-to/setup-lftp-command-line-ftp/