Project modules
Apiospora Transcriptomics project (PID 1)
Mentor – Dr. Shashi Kant
Participant - Nalinikanta Choudhury
Machine IP. add. - sutripa@10.0.0.239
Background of project: - Apiospora is a filamentous fungal species. This organism is sequenced and assembled in our lab. This genome is near complete with all annotations done in this. Since it gives out a fruity smell, we wanted to investigate the secretome of this organism in presence of cellulose in the media. Differential expression analysis was conducted in two different conditions, e.g; in presence of cellulose only media and in YPD media
Data provided:-
1. GFF file of the Apiospora species containing all the annotations.
2. SAM files of alignments of RNA seq reads with genome.
Objectives:-
1. To find the extent of overlap of assembled transcripts with the predicted gene models.
2. To curate transcripts not having any overlaps with predicted genes.
3. Understand the genes represented by transcripts after analysing their annotations
To do:
1. Assemble the transcripts using the sam files from the two different conditions (Control and Treated).
2. Find the extent of overlap of the assembled transcripts with the genes.
3. Find the transcript abundance of such transcripts.
4. For overlapping and non-overlapping genes with transcripts check the data in Jbrowse.
5. Study the annotation of the genes and formulate a hypothesis.
Phytophthora effector Prediction (PID 2)
Effector proteins play a crucial role in the interaction between pathogens and their host plants, influencing the outcome of plant-fungal interactions. This project aims to enhance our understanding of effector proteins in phytopthora pathogens and contribute to the development of strategies for disease resistance in plants.
Mentor: Dr. Aditi Maulik
Participants: Shivangi Agrawal, Tanmoy Dey
Machine IP. add. - sutripa@10.0.0.239
Objective:
Building a Hidden Markov Model (HMM) of RXLRs to predict the motif in Phytopthora effectors
Method:
Please follow the steps below for building a HMM model of RXLR motifs found in Phytopthora effectors
1. All the effectors curated for 128 Phytophthora species will be provided.
2. Clusterization result of these effectors using CD-hit will be given.
3. Extract the X number of clusters with minimum N number of sequences.
4. Perform MSA for each of the X number of clusters using any of the aligner.
5. Manually check for the presence and conservation of RXLR motifs in the alignment.
6. Build HMM model for each of the alignment results. X number of HMM will be generated.
7. Merge all the models to make a new updated HMM.
8. Search the number of RXLRs in the effectors curated for 128 Phytophthora species using the merged model.
9. Compare the number of RXLR motifs found using PFAM HMM models
10. Run the step 1-9 iteratively using different X and N until no new RXLR will be reported for the effectors of 128 Phytophthora species.
11. You can start the iteration with N=10 and X~ 3000 and then adjust the numbers to predict more number of motifs.
12. Make a Snakemake pipeline for the above process.
Expected outcome:
An improved HMM model for RXLR prediction in Phytopthora effectors
Two speed genome of all fungi (PID 3)
Mentor –Dr. Shashi Kant, Vaishnavi
Participant - Koushik Bardhan, Jigar Harishkumar Sheth
Machine IP. add. - sutripa@10.0.0.239
1. Download the near complete fungal genomes and their annotations from NCBI.
2. Choose the ones having annotation and high quality genomes.
3. Genomes have to be categorized based on their lifestyles e.g; Parasites, saprophytes
and endophyte.
4. Predict the effectors using in- house pipeline.
5. Predict CAZymes using dbCAN3.
6. Calculate the FIR (Flanking Intergenic Region) of genome, CAZymes and Effectors.
7. Check the distribution of Flanking Intergenic Regions of the genomes and the
predicted effectors and CAZymes.
8. Identify two-speed or one-speed genome from the overlapping regions.
Text mining using Natural Language Processing Techniques (PID 4)
Participant - M NITHYA KRUTHI
Machine IP. add. - sutripa@10.0.0.239
Develop text mining algorithms based on NLP i.e., Natural Language Processing. NLP combines computational linguistics with statistical, machine learning, and deep learning models. It enables computers to process and understand human language in text or voice, facilitating tasks like language translation, response to spoken commands, and rapid text summarization, even in real-time.
Curate tree and plant genomes sequenced in India (PID 5)
Mentor: Dr. Sucheta Tripathy, Asharani prusty
Participants: Konda Sameer,
The objective of this project is to systematically gather, organize, and analyse genomic data from tree and plant species that have been sequenced in India. The focus is on creating a detailed database of these genomes, showcasing important species, their genetic characteristics, and their significance in agriculture, ecology, and conservation.
Contamination detection in sequences (PID 6)
Machine IP. add. - sutripa@10.0.0.243
Mentor: Aditya Upadhyay
Participants: Shuvayan Dasgupta, Ayushman Kumar Banerjee
Description: This project focuses on applying machine learning techniques for read-level
binning to identify and remove contamination from raw sequencing data. By training models
on both contaminated and clean sequence reads, the aim is to accurately classify and separate
unwanted sequences from target data at the read level. This method enhances the precision
and efficiency of quality control, ensuring cleaner datasets for downstream genomic analyses
like metagenomics and transcriptomic.
Objective: Data Preparation and Baseline Model Setup
Dataset Collection and Labelling:
1. Collect and curate sequencing datasets containing both clean and contaminated reads.
2. You can use synthetic datasets, publicly available datasets (e.g., from NCBI SRA or
ENA), or generate contaminated samples by mixing clean reads with known
contaminants (e.g., bacterial sequences).
3. Label the reads based on whether they are contaminated or clean, creating a training
set and test set for your model.
Feature Extraction:
1. Extract relevant features from the sequencing reads that can be fed into a machine
learning model. This might include:
○ K-mer frequencies (commonly used in sequence classification).
○ GC content.
○ Read length and coverage depth.
2. Consider dimensionality reduction techniques like PCA if needed to reduce feature
complexity.
3. Build a Baseline Machine Learning Model:
4. Develop a simple classifier as a proof-of-concept using basic machine learning
algorithms like Random Forest or Support Vector Machine (SVM).
5. Train the model on a subset of the labelled reads and evaluate its performance using
common metrics (e.g., accuracy, precision, recall).
6. Aim to establish a baseline performance metric that can be used to improve the model
later.
Metagenomics (PID 7)
Mentor: Vaishnavi, Dr.Aditi Maulik
Participants: Niskarsh Vikram Singh
Objective: Assembly based and / read based metagenomic data analysis to determine taxonomic and genome abundance using advanced metagenome tools.
Method:
Please follow the steps below for the metagenomic analysis.
1. Metagenomic read set will be provided.
2. Ensure that the raw sequencing reads are of high quality and remove low-quality sequences, adapters, and contaminants.
3. Identify the taxonomic composition of the microbial community in the metagenomic sample.
4. Assemble the metagenomic reads into contigs to study the microbial genome sequences.
5. Group assembled contigs into genome bins (if any) representing individual microbial genomes.
6. Predict genes from assembled contigs or bins and annotate them to understand their functional potential.
7. Gene Abundance Quantification: Quantify the abundance of predicted genes across metagenomic samples.
8. Functional Profiling: Determine the functional composition of the microbial community based on gene annotations.
9. Visualize the results of taxonomic and functional profiling for interpretation.
Expected outcome:
Taxonomic Profiles: Identification of microbial species in the community and their relative abundances.
Functional Profiles: Insights into the biological functions and metabolic pathways present in the community.
Gene Abundance: Quantification of genes or gene families and their functional roles.
Assembled Contigs and Bins: Draft genomes representing members of the microbial community.
Visualizations: Graphical representation of taxonomic and functional diversity.