Progress & timeline

*The report is not strictly in chronological order, activities and progress are organized by task.

jan 11-29

Topic selection
Paper / relevant material reading
Set up the template of the progress report website
Get familiar with NCBI GenBank

Project Data set webpage: https://www.ncbi.nlm.nih.gov/bioproject/678522

Doc: https://www.ncbi.nlm.nih.gov/genbank/

Learn about different data types in the NCBI GenBank

Reference: https://www.jianshu.com/p/4948292d8120

Get familiar with SRA toolkit's installation, configuration and its command line usage

Doc: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

Configuration: https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration

Other useful resources: https://hpc.nih.gov/apps/sratoolkit.html

Getting the data set
- Because the size of the data set is so large, cloud service is needed for data storage and manipulation
  - GCP and AWS options were provided by NCBI for direct data delivery. I chose the latter
  - 55 GB raw data in total
- Learn how to transfer data from the NCBI GenBank to personal AWS S3 instance as main data storage
- Get familiar with AWS S3 for data storage
  - - S3 Browser - as data set viewer

S3 Browser view for the first data set

Feb 1 - 12

Create and connect an AWS EC2 instance with putty on Windows for command line file manipulation
Study the basic use of AWSCLI to work on S3 data from the EC2 instance

Main tutorial & doc: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonS3.html

Getting the data set
- Setup AWSCLI and link AWS S3 storage with an AWS EC2 instance for basic data manipulation
- Use SRA toolkit in data pre-processing
- Learn about the basic operation of Python boto3, a package for file manipulation on AWS

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html

- Create the first snippet to load data with Jupyter Notebook
  - Will consider uploading the ipynb file to Google Colab for more fancy stuff

File import example (SRA-database-header file)

feb 16 - 26

If you encountered errors in space limit on EC2, check this out:

https://stackoverflow.com/a/54347393/6269153

Further data pre-processing and feature engineering

mar 1 - 12

*Aside

Some extra fancy stuff that will be implemented if time allows

Tutorial on how to add Google Colab code to Google Sites

https://medium.com/@lzhou1110/how-to-embed-google-colaboratory-into-medium-in-3-steps-487b525b103c

Page updated

Google Sites

Report abuse