Evaluating data quality from multiple providers

Post date: Aug 25, 2017 7:03:42 PM

Our goal here is to evaluate the data quality from multiple data providers. At the end of the day we want to be able to recommend which data provider(s) to subscribe to.

Get the data from data providers to our database

A data provider normally shares files with us via ftp or sftp. We want to copy the data to our database, in this case I create S3 buckets to store such data. My plan is to use lftp (please refer to Appendix A: LFTP) to copy the data from remote directory (provide end) to an EC2 instance, where the data is processed and reformatted before moving to S3 using boto package (in python).

The following scripts can be helpful for data processing

# Screen, so that you can leave from the working console without terminating the running tasks

> screen

> screen -ls

> screen -r <id>

# install lftp

> sudo yum install lftp

# lftp interactive mode

> lftp sftp://<user>:<password>@<host>

lftp> mirror --use-pget-n=5 <folder>

# script

lftp sftp://<user>:<password>@<host> -e "mirror --use-pget-n=20 <folder>; bye"

lftp sftp://<user>:<password>@<host> -e "pget -n <number_of_segments> <folder>; bye"

# Copy data to S3

aws s3 cp <folder> s3://<bucket_name>/<folder> --recursive --region us-west-2

# copy specific extension to s3

aws s3 cp ./ s3://<bucket name>/<folders>/ --recursive --region us-west-2 --exclude '*' --include '*.gz'

# Copy data from S3

aws s3 cp s3://<bucket_name>/<path>/ ./ --recursive --region us-west-2

# how to split files into small piece

split -C 50M -a 8 -d <your_big_file> <output_prefix>

# how to remove the header

head -n 1 <original_file> > <header_file>

tail -n +2 <original_file> > <no_header>

# how to gzip them

gzip <filename>

# List files in directory

aws s3 ls --human-readable s3://<your s3 path>

# split file and gzip

search_dir=/tmp/raw_data/

for eachfile in `ls $search_dir | grep .TXT`; do

    filename=$(basename "$eachfile");

    filename="${filename%.TXT}";

    echo spliting $filename.TXT;

    split -C 100M -a 8 -d $search_dir$filename.TXT ${filename}_ --verbose

    for entry in `ls | grep "${filename}_"`; do

        entryname=$(basename "$entry");

        echo gzipping $entryname;

        gzip $entryname;

    done

done

Using Hive to analyze the data

- how to store the data

- It can be difficult if the data format is fixed width, refer to this post:

http://www.hadooptechs.com/hive/fixed-width-files-in-hive

Methodology

- precision and coverage

Appendix A: LFTP

https://whatbox.ca/wiki/lftp

https://docs.salixos.org/wiki/How_to_send/get_a_file_to/from_a_remote_server_via_command_line_or_script

https://linoxide.com/linux-how-to/setup-lftp-command-line-ftp/

Google Sites

Report abuse