One of the most common cancers worldwide is colorectal cancer (CRC), with western countries being the most affected. Among Hispanics, CRC accounted for 12% and 8% of all estimated new cases of cancer in men and women, respectively, in 2018. Human CRC has often been attributed to various environmental and lifestyle factors. The association between oxidative stress and CRC has become an interesting area of study during the last decade, with the identification of numerous genetic and lifestyle factors that can affect an individual’s ability to respond to oxidative stress. The overall goal of this project is to a) link oxidative stress and survival pathway genes to CRC; b) to identify if these genes are differentially expressed in Hispanics when compared to NHWs, and c) to provide essential information for accurate interpretation of future research on oxidative stress and CRC risk in Hispanics. The CRC datasets are collected from GEO and TCGA. The differential expression analysis is performed using LIMMA in R and also compared the genes selected with the help of machine learning approaches. The biological pathways are analyzed using various bioinformatic tools.
Programming/ bioinformatics tools: R (Bioconductor, ggplot2), David, KEGG, Cytoscape, and Ingenuity Pathway Analysis.
A combination of resampling-based least absolute shrinkage and selection operator (LASSO) feature selection (RLFS) and ensembles of regularized regression (ERRM) algorithm capable of dealing data with the high correlation structures was proposed. The ERRM boosts the prediction accuracy with the top-ranked features obtained from RLFS. The RLFS utilizes the lasso penalty with sure independence screening (SIS) condition to select the top k ranked features. The ERRM includes five individual penalty based classifiers: LASSO, adaptive LASSO (ALASSO), elastic net (ENET), smoothly clipped absolute deviations (SCAD), and minimax concave penalty (MCP). It was built on the idea of bagging and rank aggregation. We performed simulation studies and applied the smokers’ lung cancer gene expression data and showed that the proposed combination of ERRM with RLFS achieved superior performance of accuracy and geometric mean. The article can be found here.
Programming: R (doParallel, adabag, tidyverse, RankAggreg, Biocomb, praznik, randomForest, glmnet, ncvreg, e1071, caret, ggplot2)
The various classification methods performance mainly rely on the selection of significant genes. Sparse regularized regression (SRR) models using the least absolute shrinkage and selection operator (lasso) and adaptive lasso (alasso) are popularly used for gene selection and classification. Nevertheless, it becomes challenging when the genes are highly correlated. Here, we proposed a modified adaptive lasso with weights using the ranking-based feature selection (RFS) methods capable of dealing with the highly correlated gene expression data. Firstly, an RFS methods such as Fisher's score (FS), Chi-square (CS), and information gain (IG) are employed to ignore the unimportant genes and the top significant genes are chosen through sure independence screening (SIS) criteria. The scores of the ranked genes are normalized and assigned as proposed weights to the alasso method to obtain the most significant genes that were proven to be biologically related to the cancer type and helped in attaining higher classification performance. With the synthetic data and real application of colon cancer microarray data, we demonstrated that the proposed alasso method with RFS methods is a better approach than the other known methods such as alasso with filtering such as ridge and marginal maximum likelihood estimation (MMLE), lasso and alasso without filtering. The metrics of accuracy, area under the receiver operating characteristics curve (AUROC), and geometric mean (GM-mean) are used for evaluating the performance of the models. The article can be found here.
Programming: R (glmnet, ncvreg, praznik, PredPsych, Biocomb, ggplot2, and mvtnorm)
The high-throughput correlated DNA methylation (DNAmeth) dataset generated from Illumina Infinium Human Methylation 27 (IIHM 27K) BeadChip assay. In the DNAmeth data, there are several CpG sites for every gene, and these grouped CpG sites are highly correlated. Most of the current filtering-based ranking (FBR) methods do not consider the group correlation structures. Obtaining the significant features with the FBR methods and applying these features to the classifiers to attain the best classification accuracy in highly correlated DNAmeth data is a challenging task. In this research, we introduce a resampling of group least absolute shrinkage and selection operator (glasso) FBR method capable of ignoring the unrelated features in the data considering the group correlation among the features. The various classifiers, such as random forests (RF), Naive Bayes (NB), and support vector machines (SVM) with the significant CpGs obtained from the proposed resampling of group lasso-based ranking (RGLR) method helped to boost the classification accuracy. The simulated and experimental prostate DNAmeth data were used to show the higher performance of RGLR method. The article can be found here.
Programming: R ( gglasso, SIS, glmnet, ncvreg, caret, PredPsych, randomForest, praznik, varbvs, e1071, Biocomb, and ggplot2 )
The Illumina Infinium HumanMethylation27 BeadChip assay (Illumina 27K) methylation data is less commonly used in comparison to gene expression in bioinformatics. It provides a critical need to find the optimal feature ranking (FR) method for handling the high dimensional data. The optimal FR method on the classier is not well known, and choosing the best performing FR method becomes more challenging in high dimensional data setting. Therefore, identifying the statistical methods which boost the inference is of crucial importance in this context. This work describes the detailed performances of FR methods such as fisher score, information gain, chi-square, and minimum redundancy and maximum relevance on different classification methods such as Adaboost, Random Forest, Naive Bayes, and Support Vector Machines. The article can be found here.
Programming: R ( adabag, glmnet, PredPsych, randomForest, praznik, e1071, Biocomb, and ggplot2 )
The purpose of this project is to develop an efficient and user-friendly bioinformatics pipeline for processing raw RNA-seq sample data from Illumina sequencers, converting them to a format easily accessible and interpretable by biomedical researchers. A typical computational workflow for RNA-seq is as follows: trimming the sequence, aligning the sequences to reference genome, normalizing transcript levels, merge the assemblies, obtain summary statistics and check for significance using t-test with Benjamini-Hochberg correction. The developed Python script uses the generated results and creates an excel file with various levels of filtering such as significance level, fold changes, etc. Based on the list of genes obtained from these filters, three more steps are performed. 1) Gene ontology terms are retrieved showing the cellular components, molecular functions, and biological processes. 2) A query of up and down regulated genes is performed in LINCS L1000 to analyze the gene expression profile for connectivity to known perturbations. 3) Venn diagrams were generated to display counts of differentially expressed genes in multiple comparative studies among different samples. The detailed RNA-seq workflow can be found here .
Programming/ bioinformatics tools: The bioinformatics open source tools for RNA-seq: Trimmomatic, Hi-Sat2, Cufflinks (Cuffmerge and Cuffdiff) were used and the summary statistics of the data was generated. Further statistical analysis was carried out using the python scripts.
A two-stage approach is proposed for multi-class classification with filtering of variables and applying those variables which pass the filter to a ensemble classifier. In the first stage we use marginal statistical tests to filter informative variables based on familywise error rate correction. These variables are input to the ensemble classifier, random forest (RF) and support vector machines (SVM) combined which is most popular as nonparametric methods to predict classification at second stage. The nonparametric methods are less sensitive to highly correlated data structure. The ensemble classifier is implemented in R and utilizes accuracy, sensitivity and specificity as metrics for determining performance. We show that our ensemble method for multinomial classification has better performance of prediction on the test set of samples compared to individual algorithm of RF and SVM. The classifiers were applied on the three microarray datasets obtained from gene expression omnibus (GEO) website. The presentation can be found here.
Programming: R packages (Bioconductor, randomForest, e1071, msgl)
The convolutional neural network model was applied on the high resolution microscopic data that was collected from estrous cycle stage of female rat. The data was pre-processed, split into training and validation sets, and the convolutional network model was built with the Tensorflow (Keras) to classify the estrous cycle images into 4 cycles: Metestrus, Diestrus, Proestrus, and Estrus. Achieved the model accuracy of 90%. The project report can be found here.
Programming: Python (Tensorflow (Keras), Matplotlib) and Go programming language.
The purpose of the Ph.D. Progress Tracking Tool is to track Ph.D. students' progress at The University of Texas at El Paso (UTEP) during and after their Ph.D. degree conferral. The intended audience of this tool will be a set of users, consisting of current UTEP CS faculty and Ph.D. Students, including administrators, advisors, current students, and alumni. There are two end goals for this project. One goal is to ensure non graduated students are progressing towards their degrees in a timely manner. The second goal is to track the graduated student's professional career post graduation. The website followed the basic guidelines set by UTEP. The project report can be found here.
Programming: PHP and SQL.
Computational method for predicting protein subcellular localization was developed using Associative Classification technique of data mining. Protein sequences were modeled as document sets. Approach used was to divide a protein sequence into short k-mer sequence fragments which can be mapped to word features in document classification. A large number of class association rules were then mined from the protein sequence examples that range from the N-terminus to the C-terminus. Then, a boosting algorithm was applied to those rules to build up a final classifier. Experimental results using benchmark data sets show that this method is excellent in terms of both the classification performance and the test coverage. The result also implies that the k-mer sequence features which determine subcellular locations do not necessarily exist in specific positions of a protein sequence.
Programming: Perl and Shell scripts
Secure Electronic Transactions (SET) is an open protocol which has the potential to emerge as a dominant force in the securing of electronic transactions. Secure Electronic Transactions (SET) relies on the science of cryptography – the art of encoding and decoding messages. C# is used to develop the front end User Interface and the Database (DBMS) is used to store the data of the users at the backend and the encryption of the data is done through the AES algorithm and the RSA algorithm is also used to provide more security i.e. for generation of the private and the public keys. Here with this app there can be more one user using the app in the same device. And there can be more than one account in multi-banks for a single user. The transaction logs are stored in the system using Database in the form of a Digital Passbook.
Programming: C# and SQL