Prediction of Novel Autism Risk Genes by Genomic Data Mining
Sam Buckley
Authors: Sam Buckley, Krishna Patel, Matt Mytych, Dr. Wang, Anqi Wei, and Jun Wang
Faculty Mentor: Dr. Liangjiang Wang
College: College of Science
ABSTRACT
Autism Spectrum Disorders (ASD) refer to a group of neurodevelopmental disorders characterized by cognitive and behavioral delays. Many of the underlying causes of ASD delve into the molecular level, including both protein-coding and non-coding genes. Long non-coding RNAs (lncRNAs) are a group of non-coding RNAs that have no protein coding capacity, but have been linked to ASD. Traditional methods for identifying and validating ASD risk genes is time-consuming and costly, thus a machine learning model is necessary. In this study, we built machine learning models to predict and prioritize candidate lncRNAs associated with ASD. Three different models were trained using brain gene expression data collected from BrainSpan. Performance of the Support Vector Machine (SVM) model was compared to other classifiers, such as Logistic Regression (LR) and Random Forest (RF). From all three models, 564 lncRNAs and 6,093 protein-coding genes were predicted to be high-confidence ASD risk candidate genes. Developing a model to predict and prioritize autism-associated genes is one step closer to understanding the pathogenesis of ASD and to potentially find ways for treatment.
Video Introduction
Sam Buckley 2020 Undergraduate Research Symposium