Project modules

Mentor – Dr. Shashi Kant

Participant - Nalinikanta Choudhury

Machine IP. add. - sutripa@10.0.0.239

Background of project: - Apiospora is a filamentous fungal species. This organism is sequenced and assembled in our lab. This genome is near complete with all annotations done in this. Since it gives out a fruity smell, we wanted to investigate the secretome of this organism in presence of cellulose in the media. Differential expression analysis was conducted in two different conditions, e.g; in presence of cellulose only media and in YPD media

Data provided:-

1. GFF file of the Apiospora species containing all the annotations.

2. SAM files of alignments of RNA seq reads with genome.

Objectives:-

1. To find the extent of overlap of assembled transcripts with the predicted gene models.

2. To curate transcripts not having any overlaps with predicted genes.

3. Understand the genes represented by transcripts after analysing their annotations 

To do:

1. Assemble the transcripts using the sam files from the two different conditions (Control and Treated).

2. Find the extent of overlap of the assembled transcripts with the genes.

3. Find the transcript abundance of such transcripts.

4. For overlapping and non-overlapping genes with transcripts check the data in Jbrowse.

5. Study the annotation of the genes and formulate a hypothesis.

 

Effector proteins play a crucial role in the interaction between pathogens and their host plants, influencing the outcome of plant-fungal interactions. This project aims to enhance our understanding of effector proteins in phytopthora pathogens and contribute to the development of strategies for disease resistance in plants.

Mentor: Dr. Aditi Maulik

Participants: Shivangi Agrawal, Tanmoy Dey

Machine IP. add. - sutripa@10.0.0.239

Objective:

Building a Hidden Markov Model (HMM) of RXLRs to predict the motif in Phytopthora effectors

Method:

 

Please follow the steps below for building a HMM model of RXLR motifs found in Phytopthora effectors

1. All the effectors curated for 128 Phytophthora species will be provided.

2. Clusterization result of these effectors using CD-hit will be given.

3. Extract the X number of clusters with minimum N number of sequences.

4. Perform MSA for each of the X number of clusters using any of the aligner.

5. Manually check for the presence and conservation of RXLR motifs in the alignment.

6. Build HMM model for each of the alignment results. X number of HMM will be generated.

7. Merge all the models to make a new updated HMM.

8. Search the number of RXLRs in the effectors curated for 128 Phytophthora species using the merged model.

9. Compare the number of RXLR motifs found using PFAM HMM models

10. Run the step 1-9 iteratively using different X and N until no new RXLR will be reported for the effectors of 128 Phytophthora species.

11. You can start the iteration with N=10 and X~ 3000 and then adjust the numbers to predict more number of motifs.

12. Make a Snakemake pipeline for the above process.

Expected outcome:

An improved HMM model for RXLR prediction in Phytopthora effectors

 

Mentor –Dr. Shashi Kant, Vaishnavi

Participant - Koushik Bardhan, Jigar Harishkumar Sheth

 Machine IP. add. - sutripa@10.0.0.239

1. Download the near complete fungal genomes and their annotations from NCBI.

2. Choose the ones having annotation and high quality genomes.

3. Genomes have to be categorized based on their lifestyles e.g; Parasites, saprophytes

and endophyte.

4. Predict the effectors using in- house pipeline.

5. Predict CAZymes using dbCAN3.

6. Calculate the FIR (Flanking Intergenic Region) of genome, CAZymes and Effectors.

7. Check the distribution of Flanking Intergenic Regions of the genomes and the

predicted effectors and CAZymes.

8. Identify two-speed or one-speed genome from the overlapping regions.

 

Participant - M NITHYA KRUTHI

Machine IP. add. - sutripa@10.0.0.239

Develop text mining algorithms based on NLP i.e., Natural Language Processing. NLP combines computational linguistics with statistical, machine learning, and deep learning models. It enables computers to process and understand human language in text or voice, facilitating tasks like language translation, response to spoken commands, and rapid text summarization, even in real-time.


Mentor: Dr. Sucheta Tripathy, Asharani prusty

Participants: Konda Sameer,

The objective of this project is to systematically gather, organize, and analyse genomic data from tree and plant species that have been sequenced in India. The focus is on creating a detailed database of these genomes, showcasing important species, their genetic characteristics, and their significance in agriculture, ecology, and conservation.

 

Machine IP. add. - sutripa@10.0.0.243

Mentor: Aditya Upadhyay

Participants: Shuvayan Dasgupta, Ayushman Kumar Banerjee

Description: This project focuses on applying machine learning techniques for read-level

binning to identify and remove contamination from raw sequencing data. By training models

on both contaminated and clean sequence reads, the aim is to accurately classify and separate

unwanted sequences from target data at the read level. This method enhances the precision

and efficiency of quality control, ensuring cleaner datasets for downstream genomic analyses

like metagenomics and transcriptomic.

Objective: Data Preparation and Baseline Model Setup

Dataset Collection and Labelling:

1.     Collect and curate sequencing datasets containing both clean and contaminated reads.

2.     You can use synthetic datasets, publicly available datasets (e.g., from NCBI SRA or

ENA), or generate contaminated samples by mixing clean reads with known

contaminants (e.g., bacterial sequences).

3.     Label the reads based on whether they are contaminated or clean, creating a training

set and test set for your model.

Feature Extraction:

1.     Extract relevant features from the sequencing reads that can be fed into a machine

learning model. This might include:

○ K-mer frequencies (commonly used in sequence classification).

○ GC content.

○ Read length and coverage depth.

2.     Consider dimensionality reduction techniques like PCA if needed to reduce feature

complexity.

3.     Build a Baseline Machine Learning Model:

4.     Develop a simple classifier as a proof-of-concept using basic machine learning

algorithms like Random Forest or Support Vector Machine (SVM).

5.     Train the model on a subset of the labelled reads and evaluate its performance using

common metrics (e.g., accuracy, precision, recall).

6.     Aim to establish a baseline performance metric that can be used to improve the model

later.

 

Mentor: Vaishnavi, Dr.Aditi Maulik

Participants: Niskarsh Vikram Singh

Objective: Assembly based and / read based metagenomic data analysis to determine taxonomic and genome abundance using advanced metagenome tools.

Method:

Please follow the steps below for the metagenomic analysis.

 

1. Metagenomic read set will be provided.

2. Ensure that the raw sequencing reads are of high quality and remove low-quality sequences, adapters, and contaminants.

3. Identify the taxonomic composition of the microbial community in the metagenomic sample.

4. Assemble the metagenomic reads into contigs to study the microbial genome sequences.

5. Group assembled contigs into genome bins (if any) representing individual microbial genomes.

6. Predict genes from assembled contigs or bins and annotate them to understand their functional potential.

7. Gene Abundance Quantification: Quantify the abundance of predicted genes across metagenomic samples.

8. Functional Profiling: Determine the functional composition of the microbial community based on gene annotations.

9. Visualize the results of taxonomic and functional profiling for interpretation.

Expected outcome:

    Taxonomic Profiles: Identification of microbial species in the community and their relative abundances.

    Functional Profiles: Insights into the biological functions and metabolic pathways present in the community.

    Gene Abundance: Quantification of genes or gene families and their functional roles.

    Assembled Contigs and Bins: Draft genomes representing members of the microbial community.

    Visualizations: Graphical representation of taxonomic and functional diversity.