Progress & timeline

*The report is not strictly in chronological order, activities and progress are organized by task.

jan 11-29

  • Topic selection

  • Paper / relevant material reading

  • Set up the template of the progress report website

  • Get familiar with NCBI GenBank

Project Data set webpage: https://www.ncbi.nlm.nih.gov/bioproject/678522

Doc: https://www.ncbi.nlm.nih.gov/genbank/

  • Learn about different data types in the NCBI GenBank

Reference: https://www.jianshu.com/p/4948292d8120

  • Get familiar with SRA toolkit's installation, configuration and its command line usage

Doc: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

Configuration: https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration

Other useful resources: https://hpc.nih.gov/apps/sratoolkit.html

  • Getting the data set

    • Because the size of the data set is so large, cloud service is needed for data storage and manipulation

      • GCP and AWS options were provided by NCBI for direct data delivery. I chose the latter

      • 55 GB raw data in total

    • Learn how to transfer data from the NCBI GenBank to personal AWS S3 instance as main data storage

    • Get familiar with AWS S3 for data storage

        • S3 Browser - as data set viewer

S3 Browser view for the first data set


Feb 1 - 12

  • Create and connect an AWS EC2 instance with putty on Windows for command line file manipulation

  • Study the basic use of AWSCLI to work on S3 data from the EC2 instance

Main tutorial & doc: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonS3.html

  • Getting the data set

    • Setup AWSCLI and link AWS S3 storage with an AWS EC2 instance for basic data manipulation

    • Use SRA toolkit in data pre-processing

    • Learn about the basic operation of Python boto3, a package for file manipulation on AWS

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html

    • Create the first snippet to load data with Jupyter Notebook

      • Will consider uploading the ipynb file to Google Colab for more fancy stuff

File import example (SRA-database-header file)


feb 16 - 26

  • If you encountered errors in space limit on EC2, check this out:

https://stackoverflow.com/a/54347393/6269153

  • Further data pre-processing and feature engineering


mar 1 - 12


*Aside

Some extra fancy stuff that will be implemented if time allows