Prediction of Novel Autism Risk Genes by Genomic Data Mining

Sam Buckley

Authors: Sam Buckley, Krishna Patel, Matt Mytych, Dr. Wang, Anqi Wei, and Jun Wang

Faculty Mentor: Dr. Liangjiang Wang

College: College of Science

ABSTRACT

Autism Spectrum Disorders (ASD) refer to a group of neurodevelopmental disorders characterized by cognitive and behavioral delays. Many of the underlying causes of ASD delve into the molecular level, including both protein-coding and non-coding genes. Long non-coding RNAs (lncRNAs) are a group of non-coding RNAs that have no protein coding capacity, but have been linked to ASD. Traditional methods for identifying and validating ASD risk genes is time-consuming and costly, thus a machine learning model is necessary. In this study, we built machine learning models to predict and prioritize candidate lncRNAs associated with ASD. Three different models were trained using brain gene expression data collected from BrainSpan. Performance of the Support Vector Machine (SVM) model was compared to other classifiers, such as Logistic Regression (LR) and Random Forest (RF). From all three models, 564 lncRNAs and 6,093 protein-coding genes were predicted to be high-confidence ASD risk candidate genes. Developing a model to predict and prioritize autism-associated genes is one step closer to understanding the pathogenesis of ASD and to potentially find ways for treatment.

Video Introduction

Sam Buckley 2020 Undergraduate Research Symposium