Progress & timeline
*The report is not strictly in chronological order, activities and progress are organized by task.
jan 11-29
Topic selection
Paper / relevant material reading
Set up the template of the progress report website
Get familiar with NCBI GenBank
Project Data set webpage: https://www.ncbi.nlm.nih.gov/bioproject/678522
Doc: https://www.ncbi.nlm.nih.gov/genbank/
Learn about different data types in the NCBI GenBank
Reference: https://www.jianshu.com/p/4948292d8120
Get familiar with SRA toolkit's installation, configuration and its command line usage
Doc: https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc
Configuration: https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration
Other useful resources: https://hpc.nih.gov/apps/sratoolkit.html
Getting the data set
Because the size of the data set is so large, cloud service is needed for data storage and manipulation
GCP and AWS options were provided by NCBI for direct data delivery. I chose the latter
55 GB raw data in total
Learn how to transfer data from the NCBI GenBank to personal AWS S3 instance as main data storage
Get familiar with AWS S3 for data storage
S3 Browser - as data set viewer
S3 Browser view for the first data set
Feb 1 - 12
Create and connect an AWS EC2 instance with putty on Windows for command line file manipulation
Study the basic use of AWSCLI to work on S3 data from the EC2 instance
Main tutorial & doc: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonS3.html
Getting the data set
Setup AWSCLI and link AWS S3 storage with an AWS EC2 instance for basic data manipulation
Use SRA toolkit in data pre-processing
Learn about the basic operation of Python boto3, a package for file manipulation on AWS
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html
Create the first snippet to load data with Jupyter Notebook
Will consider uploading the ipynb file to Google Colab for more fancy stuff
File import example (SRA-database-header file)
feb 16 - 26
If you encountered errors in space limit on EC2, check this out:
https://stackoverflow.com/a/54347393/6269153
Further data pre-processing and feature engineering
mar 1 - 12
*Aside
Some extra fancy stuff that will be implemented if time allows
Tutorial on how to add Google Colab code to Google Sites
https://medium.com/@lzhou1110/how-to-embed-google-colaboratory-into-medium-in-3-steps-487b525b103c