Statistical Interpretation for Word Embeddings in Natural Language Processing
ӿZitong Zhang, §Ashraf Yaseen, Hulin Wu
International Journal of Data Science and Analytics. (accepted) 8/2025
Word embeddings, while essential in natural language processing (NLP), lack a clear theoretical statistical foundation. It is challenging to interpret the specific quantity being optimized by the various training methods, or to understand the rationale behind their effectiveness in generating high-quality word representations.
In this study, we aim to bridge the gap between word embeddings and statistical methodology by demonstrating that popular NLP training methods, such as Word2Vec and fastText, can be seen as statistical estimates of the pointwise mutual information (PMI) matrix; a more interpretable and consistent text vectorization method.
To support this interpretation, we examined and compared the performance of PMI matrix representation methods on text semantic classification task with Word2Vec methods. We also included a proposed Variational Bayesian Inference approach to enhance the low-rank estimation of the sparse PMI matrix in word embedding tasks, and conducted a comparison of the proposed approach against the classic Shifted Positive PMI with Singular Value Decomposition (SPPMI-SVD) method.
Our results demonstrate the effectiveness of PMI representation of word embedding models in real-world information retrieval scenarios, and improvement of the proposed empirical positive PMI matrix compared to the performance of the classic SPPMI-SVD method.
Increasing COVID-19 Testing and Vaccination Uptake in the Take Care Texas Community-Based Randomized Trial: Adaptive Geospatial Analysis
Kehe Zhang, Jocelyn V Hunyadi, Marcia C de Oliveira Otto, Miryoung Lee, Zitong Zhang, Ryan Ramphul, Jose-Miguel Yamal, Ashraf Yaseen, Alanna C Morrison, Shreela Sharma, Mohammad Hossein Rahbar, Xu Zhang, Stephen Linder, Dritana Marko, Rachel White Roy, Deborah Banerjee, Esmeralda Guajardo, Michelle Crum, Belinda Reininger, Maria E Fernandez, Cici Bauer
Journal of Medical Internet Research (JMIR). Form Res 2025;9:e62802. PMID: 39935005 PMCID: PMC11835599 DOI: 10.2196/62802
Background: Geospatial data science can be a powerful tool to aid the design, reach, efficiency, and impact of community-based intervention trials. The project titled Take Care Texas aims to develop and test an adaptive, multilevel, community-based intervention to increase COVID-19 testing and vaccination uptake among vulnerable populations in 3 Texas regions: Harris County, Cameron County, and Northeast Texas.
Objective: We aimed to develop a novel procedure for adaptive selections of census block groups (CBGs) to include in the community-based randomized trial for the Take Care Texas project.
Methods: CBG selection was conducted across 3 Texas regions over a 17-month period (May 2021 to October 2022). We developed persistent and recent COVID-19 burden metrics, using real-time SARS-CoV-2 monitoring data to capture dynamic infection patterns. To identify vulnerable populations, we also developed a CBG-level community disparity index, using 12 contextual social determinants of health (SDOH) measures from US census data. In each adaptive round, we determined the priority CBGs based on their COVID-19 burden and disparity index, ensuring geographic separation to minimize intervention "spillover." Community input and feedback from local partners and health workers further refined the selection. The selected CBGs were then randomized into 2 intervention arms-multilevel intervention and just-in-time adaptive intervention-and 1 control arm, using covariate adaptive randomization, at a 1:1:1 ratio. We developed interactive data dashboards, which included maps displaying the locations of selected CBGs and community-level information, to inform the selection process and guide intervention delivery. Selection and randomization occurred across 10 adaptive rounds.
Results: A total of 120 CBGs were selected and followed the stepped planning and interventions, with 60 in Harris County, 30 in Cameron County, and 30 in Northeast Texas counties. COVID-19 burden presented substantial temporal changes and local variations across CBGs. COVID-19 burden and community disparity exhibited some common geographical patterns but also displayed distinct variations, particularly at different time points throughout this study. This underscores the importance of incorporating both real-time monitoring data and contextual SDOH in the selection process.
Conclusions: The novel procedure integrated real-time monitoring data and geospatial data science to enhance the design and adaptive delivery of a community-based randomized trial. Adaptive selection effectively prioritized the most in-need communities and allowed for a rigorous evaluation of community-based interventions in a multilevel trial. This methodology has broad applicability and can be adapted to other public health intervention and prevention programs, providing a powerful tool for improving population health and addressing health disparities.
Scholarly recommendation system for NIH funded grants based on biomedical word embedding models
ӿZitong Zhang, §Ashraf Yaseen, Hulin Wu.
Natural Language Processing Journal. August 2024. https://doi.org/10.1016/j.nlp.2024.100095
Objective: Research grants, which are available from several sources, are essential for scholars to sustain a good standing in academia. Although securing grant funds for research is very competitive, being able to locate and find previously funded grants and projects that are relevant to researchers’ interests would be very helpful. In this work, we developed a funded-grants/projects recommendation system for the National Institute of Health (NIH) grants.
Methods: Our system aims to recommend funded grants to researchers based on their publications or input keywords. By extracting summary information from funded grants and their associated applications, we employed two embedding models for biomedical words and sentences (biowordvec and biosentvec), and compare multiple recommendation methods to recommend the most relevant funded grants for researchers’ input
Results: Compared to a baseline method, the recommendation system based on biomedical word embedding models provided higher performance. The system also received an average rate of 3.53 out of 5, based on the relevancy evaluation results from biomedical researchers.
Conclusion: Both internal and external evaluation results prove the effectiveness of our recommendation system. The system would be helpful for biomedical researchers to locate and find previously funded grants related to their interests.
Factors Associated with Elevated SARS-CoV-2 Immune Response in Children and Adolescents
Sarah Messiah, Rhiana Abbas, Emma Bergqvist, Harold W Kohl, Michael D Swartz, Yashar Talebi, Rachit Sabharwal, Haoting Han, Melissa A Valerio-Shewmaker, Stacia M Desantis, Ashraf Yaseen, Henal A Gandhi, Ximena Flandes Amavisca, Jessica Ross, Lindsay N Padilla, Michael O Gonzalez, Leqing Wu, Mark A Silberman, David Lakey, Jennifer A Shuford, Stephen Pont and Eric Boerwinkle.
Frontiers in Pediatrics. 14 August 2024. Volume 12 - 2024 | https://doi.org/10.3389/fped.2024.1393321
Background: Understanding the distinct immunologic responses to SARS-CoV-2 infection among pediatric populations is pivotal in navigating the COVID-19 pandemic and informing future public health strategies. This study aimed to identify factors associated with heightened antibody responses in children and adolescents to identify potential unique immune dynamics in this population.
Methods: Data collected between July and December 2023 from the Texas Coronavirus Antibody REsponse Survey (Texas CARES), a statewide prospective population-based antibody survey among 1-to-19-year-old participants, were analyzed. Each participant had the following data available for analysis: (1) Roche Elecsys® Anti-SARS-CoV-2 Immunoassay for Nucleocapsid protein antibodies (Roche N-test), (2) qualitative and semi-quantitative detection of antibodies to the SARS CoV-2 spike protein receptor binding domain (Roche S-test), and (3) self-reported antigen/PCR COVID-19 test results, vaccination, and health status. Statistical analysis identified associations between participant characteristics and spike antibody quartile group.
Results: The analytical sample consisted of 411 participants (mean age 12.2 years, 50.6% female). Spike antibody values ranged from a low of 6.3 U/ml in the lowest quartile to a maximum of 203,132.0 U/ml in the highest quartile in the aggregate sample. Older age at test date (OR = 1.22, 95% CI: 1.12, 1.35, p < .001) and vaccination status (primary series/partially vaccinated, one or multiple boosters) showed significantly higher odds of being in the highest spike antibody quartile compared to younger age and unvaccinated status. Conversely, fewer days since the last immunity challenge showed decreased odds (OR = 0.98, 95% CI: 0.96, 0.99, p = 0.002) of being in the highest spike antibody quartile vs. more days since last immunity challenge. Additionally, one out of every three COVID-19 infections were asymptomatic.
Conclusions: Older age, duration since the last immunity challenge (vaccine or infection), and vaccination status were associated with heightened spike antibody responses, highlighting the nuanced immune dynamics in the pediatric population. A significant proportion of children/adolescents continue to have asymptomatic infection, which has important public health implications.
Baseline characteristics of SARS-CoV-2 vaccine non-responders in a large population-based sample
†Ashraf Yaseen, Stacia M. DeSantis, Rachit Sabharwal, Yashar Talebi, Michael D. Swartz, Shiming Zhang, Luis Leon Novelo, Cesar L Pinzon-Gomez, Sarah E. Messiah, Melissa Valerio-Shewmaker, Harold W. Kohl, Jessica Ross, David Lakey, Jennifer A. Shuford, Stephen J. Pont and Eric Boerwinkle.
PLoS One. 2024 May 13;19(5):e0303420. https://doi.org/10.1371/journal.pone.0303420 PMID: 38739625; PMCID: PMC11090326.
Introduction: Studies indicate that individuals with chronic conditions and specific baseline characteristics may not mount a robust humoral antibody response to SARS-CoV-2 vaccines. In this paper, we used data from the Texas Coronavirus Antibody REsponse Survey (Texas CARES), a longitudinal state-wide seroprevalence program that has enrolled more than 90,000 participants, to evaluate the role of chronic diseases as the potential risk factors of non-response to SARS-CoV-2 vaccines in a large epidemiologic cohort.
Methods: A participant needed to complete an online survey and a blood draw to test for SARS-CoV-2 circulating plasma antibodies at four-time points spaced at least three months apart. Chronic disease predictors of vaccine non-response are evaluated using logistic regression with non-response as the outcome and each chronic disease + age as the predictors.
Results: As of April 24, 2023, 18,240 participants met the inclusion criteria; 0.58% (N = 105) of these are non-responders. Adjusting for age, our results show that participants with self-reported immunocompromised status, kidney disease, cancer, and “other” non-specified comorbidity were 15.43, 5.11, 2.59, and 3.13 times more likely to fail to mount a complete response to a vaccine, respectively. Furthermore, having two or more chronic diseases doubled the prevalence of non-response.
Conclusion: Consistent with smaller targeted studies, a large epidemiologic cohort bears the same conclusion and demonstrates immunocompromised, cancer, kidney disease, and the number of diseases are associated with vaccine non-response. This study suggests that those individuals, with chronic diseases with the potential to affect their immune system response, may need increased doses or repeated doses of COVID-19 vaccines to develop a protective antibody level.
Long-term immune response to SARS-CoV-2 infection and vaccination in children and adolescents
Sarah E. Messiah, Yashar Talebi, Michael D. Swartz, Rachit Sabharwal, Haoting Han, Emma Bergqvist, Harold W. Kohl III, Melissa Valerio-Shewmaker, Stacia M. DeSantis, Ashraf Yaseen, Steven H. Kelder, Jessica Ross, Lindsay N. Padilla, Michael O. Gonzalez, Leqing Wu, David Lakey, Jennifer A. Shuford, Stephen J. Pont & Eric Boerwinkle.
Pediatric Research 96, 525–534 (2024). https://doi.org/10.1038/s41390-023-02857-y
Background: This analysis examined the durability of antibodies present after SARS-CoV-2 infection and vaccination in children and adolescents.
Methods: Data were collected over 4 time points between October 2020-November 2022 as part of a prospective population-based cohort aged 5-to-19 years (N = 810). Results of the (1) Roche Elecsys® Anti-SARS-CoV-2 Immunoassay for detection of antibodies to the SARS-CoV-2 nucleocapsid protein (Roche N-test); and (2) qualitative and semi-quantitative detection of antibodies to the SARS CoV-2 spike protein receptor binding domain (Roche S-test); and (3) self-reported antigen/PCR COVID-19 test results, vaccination and symptom status were analyzed.
Results: N antibody levels reached a median of 84.10 U/ml (IQR: 20.2, 157.7) cutoff index (COI) ~ 6 months post-infection and increased slightly to a median of 85.25 (IQR: 28.0, 143.0) COI at 12 months post-infection. Peak S antibody levels were reached at a median of 2500 U/mL ~6 months post-vaccination and remained for ~12 months (mean 11.6 months, SD 1.20).
Conclusions: This analysis provides evidence of robust durability of nucleocapsid and spike antibodies in a large pediatric sample up to 12 months post-infection/vaccination. This information can inform pediatric SARS-CoV-2 vaccination schedules.
Impact
This study provided evidence of robust durability of both nucleocapsid and spike antibodies in a large pediatric sample up to 12 months after infection.
Little is known about the long-term durability of natural and vaccine-induced SARS-CoV-2 antibodies in the pediatric population. Here, we determined the durability of anti–SARS-CoV-2 spike (S-test) and nucleocapsid protein (N-test) in children/adolescents after SARS-CoV-2 infection and/or vaccination lasts at least up to 12 months.
This information can inform future SARS-CoV-2 vaccination schedules in this age group.
Integrating, Harmonizing, and Curating Studies with High-Frequency and Hourly Physiological Data: Proof of Concept from Seven Traumatic Brain Injury Datasets.
†Ashraf Yaseen, Claudia Robertson, Jovany Cruz Navarro, Jingxiao Chen, Brian Heckler, Stacia DeSantis, Nancy Temkin, Jason Barber, Brandon Foreman, Ramon Diaz-Arrastia, Randall Chesnut, Geoff Manley, David Wright, Mary Vassar, Adam Ferguson, Amy Markowitz, Jose-Miguel Yamal.
Journal of Neurotrauma. 2023 Aug 16. https://doi.org/10.1089/neu.2023.0023 PMID: 37341031.
Abstract
Research in severe traumatic brain injury (TBI) has historically been limited by studies with relatively small sample sizes that result in low power to detect small, yet clinically meaningful outcomes. Data sharing and integration from existing sources hold promise to yield larger more robust sample sizes that improve the potential signal and generalizability of important research questions. However, curation and harmonization of data of different types and of disparate provenance is challenging. We report our approach and experience integrating multiple TBI data sets containing collected physiological data, including both expected and unexpected challenges encountered in the integration process. Our harmonized data set included data on 1536 patients from the Citicoline Brain Injury Treatment Trial (COBRIT), Effect of erythropoietin and transfusion threshold on neurological recovery after traumatic brain injury: a randomized clinical trial (EPO Severe TBI), BEST-TRIP, Progesterone for the Treatment of Traumatic Brain Injury III Clinical Trial (ProTECT III), Transforming Research and Clinical Knowledge in Traumatic brain Injury (TRACK-TBI), Brain Oxygen Optimization in Severe Traumatic Brain Injury Phase-II (BOOST-2), and Ben Taub General Hospital (BTGH) Research Database studies. We conclude with process recommendations for data acquisition for future prospective studies to aid integration of these data with existing studies. These recommendations include using common data elements whenever possible, a standardized recording system for labeling and timing of high-frequency physiological data, and secondary use of studies in systems such as Federal Interagency Traumatic Brain Injury Research Informatics System (FITBIR), to engage investigators who collected the original data.
An Interactive Online Dashboard with Covid-19 Trends and Data Analysis in Northeast and South Texas
ӿZitong Zhang, ӿRachit Sabharwal, Miryoung Lee, Kehe Zhang, Paul McGaha, Michelle Crum, Cici Bauer, Susan P. Fisher-Hoch, Joseph B. McCormick, Belinda M Reininger, Samantha Thomas, Esmeralda Guajardo, Daniel Pinon, §Ashraf Yaseen.
The Texas Public Health Journal (TPHJ). Volume 76 Issue 2. 2023.
Abstract
In response to the pandemic due to COVID-19, we developed online interactive dashboards to track and visualize the reported cases in real-time in certain geographical areas in Texas. In collaboration with local public health departments, we established an automated process of data collection/processing and then developed dashboards with more detailed tracking and analysis results. The dashboards aimed to provide Texans in those areas with health disparities with clear and detailed in-time tracking and analysis results of COVID-19 data and to help the public understand the in-time status of COVID-19 in their areas. Evaluation of the dashboards was conducted by surveying our collaborators in those areas and deemed highly effective and useful. We believe that the framework used in developing the dashboards and the data processing pipeline presented in this work could be used in tracking emerging infectious diseases and planning for the emergency management in the future.
A Content-Based Dataset Recommendation System for Biomedical Datasets.
ӿZitong Zhang, §Ashraf Yaseen.
International Conference on Information and Computer Technologies (ICICT), Raleigh, NC, USA, 2023 pp. 198-202. doi: 10.1109/ICICT58900.2023.00040
Abstract:
Nowadays, with the rapid development of cloud data and online collaboration platforms, there is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. On one hand, sharing data with collaborators increases the visibility of the work. On the other hand, the abundance of data on multiple platforms makes it hard for researchers to find data relevant to their own research. To overcome this challenge, a dataset recommendation system capable of finding relevant datasets from multiple resources would be helpful. In the past two decades, few dataset recommendation methods have been implemented, that are mostly domain-specific or simply recommend datasets based on keywords. We believe a general dataset recommender system that recommends datasets with information either extracted from another dataset or supplied by researchers can enhance researchers’ efficiency in searching for relevant data and significantly improve their research efficiency. This work adopts an information retrieval (IR) paradigm for dataset recommendation. By extracting summary information from each dataset and generating a profile for each, we use and compare multiple content-based recommendation methods to recommend the most-relevant datasets in GEO, SRA, and several other repositories. Our results and evaluations prove the usefulness and need for such system.
Incorporating uncertainty quantification for actionable insights and performance improvement of academic recommenders.
ӿJie Zhu, §Ashraf Yaseen, Luis Leon-Novelo.
Knowledge 2023, 3, 293-306. https://doi.org/10.3390/knowledge3030020
Abstract
Deep learning is widely used in many real-life applications. Despite their remarkable performance accuracies, deep learning networks are often poorly calibrated, which could be harmful in risk-sensitive scenarios. Uncertainty quantification offers a way to evaluate the reliability and trustworthiness of deep-learning-based model predictions. In this work, we introduced uncertainty quantification to our virtual research assistant recommender platform through both Monte Carlo dropout ensemble techniques. We also proposed a new formula to incorporate the uncertainty estimates into our recommendation models. The experiments were carried out on two different components of the recommender platform (i.e., a BERT-based grant recommender and a temporal graph network (TGN)-based collaborator recommender) using real-life datasets. The recommendation results were compared in terms of both recommender metrics (AUC, AP, etc.) and the calibration/reliability metric (ECE). With uncertainty quantification, we were able to better understand the behavior of our regular recommender outputs; while our BERT-based grant recommender tends to be overconfident with its outputs, our TGN-based collaborator recommender tends to be underconfident in producing matching probabilities. Initial case studies also showed that our proposed model with uncertainty quantification adjustment from ensemble gave the best-calibrated results together with the desirable recommender performance.
Scholarly Recommendation Systems: A Literature Survey.
ӿZitong Zhang, Braja Gopal Patra, §Ashraf Yaseen, ӿJie Zhu, ӿRachit Sabharwal, Kirk Roberts, Tru Cao, and Hulin Wu.
Knowledge and Information Systems (2023), https://doi.org/10.1007/s10115-023-01901-x
Abstract
A scholarly recommendation system is an important tool for identifying prior and related resources such as literature, datasets, grants, and collaborators. A well-designed scholarly recommender significantly saves the time of researchers and can provide information that would not otherwise be considered. The usefulness of scholarly recommendations, especially literature recommendations, has been established by the widespread acceptance of web search engines such as CiteSeerX, Google Scholar, and Semantic Scholar. This article discusses different aspects and developments of scholarly recommendation systems. We searched the ACM Digital Library, DBLP, IEEE Explorer, and Scopus for publications in the domain of scholarly recommendations for literature, collaborators, reviewers, conferences and journals, datasets, and grant funding. In total, 225 publications were identified in these areas. We discuss methodologies used to develop scholarly recommender systems. Content-based filtering is the most commonly applied technique, whereas collaborative filtering is more popular among conference recommenders. The implementation of deep learning algorithms in scholarly recommendation systems is rare among the screened publications. We found fewer publications in the areas of the dataset and grant funding recommenders than in other areas. Furthermore, studies analyzing users’ feedback to improve scholarly recommendation systems are rare for recommenders. This survey provides background knowledge regarding existing research on scholarly recommenders and aids in developing future recommendation systems in this domain.
Incidence and predictors of breakthrough and severe breakthrough infections of SARSCoV-2 after primary series vaccination in adults: A population-based survey of 90,000 participants
Stacia DeSantis, †Ashraf Yaseen, Tianyao Hao, Luis León-Novelo, Yashar Talebi, Melissa Valerio-Shewmaker, Cesar Pinzon Gomez, Sarah Messiah, Harold Koh, Steven Kelder, Jessica Ross, Lindsay Padilla, Mark Silberman, Samantha Tuzo, David Lakey, Jennifer Shuford, Stephen Pont, Eric Boerwinkle, Michael Swartz.
Journal of Infectious Diseases. 2023 May 12;227(10):1164-1172. doi: 10.1093/infdis/jiad020. PMID: 36729177
Abstract
Background: Breakthrough infections of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) are well documented. The current study estimates breakthrough incidence across pandemic waves, and evaluates predictors of breakthrough and severe breakthrough infections (defined as those requiring hospitalization).
Methods: In total, 89 762 participants underwent longitudinal antibody surveillance. Incidence rates were calculated using total person-days contributed. Bias-corrected and age-adjusted logistic regression determined multivariable predictors of breakthrough and severe breakthrough infection, respectively.
Results: The incidence was 0.45 (95% confidence interval [CI], .38–.50) during pre-Delta, 2.80 (95% CI, 2.25–3.14) during Delta, and 11.2 (95% CI, 8.80–12.95) during Omicron, per 10 000 person-days. Factors associated with elevated odds of breakthrough included Hispanic ethnicity (vs non-Hispanic white, OR = 1.243; 95% CI, 1.073–1.441), larger household size (OR = 1.251 [95% CI, 1.048–1.494] for 3–5 vs 1 and OR = 1.726 [95% CI, 1.317–2.262] for more than 5 vs 1 person), rural versus urban living (OR = 1.383; 95% CI, 1.122–1.704), receiving Pfizer or Johnson & Johnson versus Moderna, and multiple comorbidities. Of the 1700 breakthrough infections, 1665 reported on severity; 112 (6.73%) were severe. Higher body mass index, Hispanic ethnicity, vaccine type, asthma, and hypertension predicted severe breakthroughs.
Conclusions: Breakthrough infection was 4–25 times more common during the Omicron-dominant wave versus earlier waves. Higher burden of severe breakthrough infections was identified in subgroups.
RE: Incidence of SARS-CoV-2 Breakthrough Infections After Vaccination in Adults: A Population-Based Survey Through 1 March 2023.
Stacia M DeSantis, †Ashraf Yaseen, Tianyao Hao, Luis León-Novelo, Yashar Talebi, Melissa A Valerio-Shewmaker, Cesar L Pinzon Gomez, Sarah E Messiah, Harold W Kohl, Steven H Kelder, Jessica A Ross, Lindsay N Padilla, Mark Silberman, Samantha Wylie, David Lakey, Jennifer A Shuford, Stephen J Pont, Eric Boerwinkle, Michael D Swartz.
Open Forum Infectious Diseases, Volume 10, Issue 12, December 2023, ofad564, https://doi.org/10.1093/ofid/ofad564
A novel NIH research grant recommender using BERT.
ӿJie Zhu, Braja Patra, Hulin Wu, §Ashraf Yaseen.
PLoS One. 2023 Jan 17;18(1):e0278636. https://doi.org/10.1371/journal.pone.0278636
Abstract
Research grants are important for researchers to sustain a good position in academia. There are many grant opportunities available from different funding agencies. However, finding relevant grant announcements is challenging and time-consuming for researchers. To resolve the problem, we proposed a grant announcements recommendation system for the National Institute of Health (NIH) grants using researchers’ publications. We formulated the recommendation as a classification problem and proposed a recommender using state-of-the-art deep learning techniques: i.e. Bidirectional Encoder Representations from Transformers (BERT), to capture intrinsic, non-linear relationship between researchers’ publications and grants announcements. Internal and external evaluations were conducted to assess the system’s usefulness. During internal evaluations, the grant citations were used to establish grant-publication ground truth, and results were evaluated against Recall@k, Precision@k, Mean reciprocal rank (MRR) and Area under the Receiver Operating Characteristic curve (ROC-AUC). During external evaluations, researchers’ publications were clustered using Dirichlet Process Mixture Model (DPMM), recommended grants by our model were then aggregated per cluster through Recency Weight, and finally researchers were invited to provide ratings to recommendations to calculate Precision@k. For comparison, baseline recommenders using Okapi Best Matching (BM25), Term-Frequency Inverse Document Frequency (TF-IDF), doc2vec, and Naïve Bayes (NB) were also developed. Both internal and external evaluations (all metrics) revealed favorable performances of our proposed BERT-based recommender.
A Recommender for Research Collaborators Using Graph Neural Networks.
ӿJie Zhu, §Ashraf Yaseen.
Frontiers in Artificial Intelligence. 2022 Aug 1;5:881704. https://doi.org/10.3389/frai.2022.881704 . PMID: 35978654; PMCID: PMC9376356.
Abstract
As most great discoveries and advancements in science and technology invariably involve the cooperation of a group of researchers, effective collaboration is the key factor. Nevertheless, finding suitable scholars and researchers to work with is challenging and, mostly, time-consuming for many. A recommender who is capable of finding and recommending collaborators would prove helpful. In this work, we utilized a life science and biomedical research database, i.e., MEDLINE, to develop a collaboration recommendation system based on novel graph neural networks, i.e., GraphSAGE and Temporal Graph Network, which can capture intrinsic, complex, and changing dependencies among researchers, including temporal user–user interactions. The baseline methods based on LightGCN and gradient boosting trees were also developed in this work for comparison. Internal automatic evaluations and external evaluations through end-users' ratings were conducted, and the results revealed that our graph neural networks recommender exhibits consistently encouraging results.
Sensitivity Analysis of a BERT-based scholarly recommendation system.
ӿJie Zhu, Hulin Wu, §Ashraf Yaseen.
FLAIRS Conference Proceedings, 35. 2022. https://doi.org/10.32473/flairs.v35i.130595.
Abstract
With the exponential growth of publicly available datasets, a scholarly recommendation system of datasets would be an essential tool in the field of information filtering. Recommending datasets to users can be formulated as a classification problem where deep learning models can be carefully trained. In such a case, when preparing training data for the learning models, one needs to consider different ratios of false and true pairs. Therefore, a sensitivity analysis is necessary. In this work, we conduct a sensitivity analysis using different class ratios on a deep learning model (BERT) for recommending datasets. We found out that our BERT-based recommender model is relatively robust using recommender metrics such as Mean Reciprocal Rank (MRR)@k, Recall@k, etc., except for the extreme class imbalance case (1:5000). Therefore, we conclude that a moderate ratio of the random negative sampling scheme, (in our case 1:10) is reasonable, sufficient and time-efficient in the recommendation system training
SARS-CoV-2 Serostatus and COVID-19 Illness Characteristics by Variant Time Period in Non-Hospitalized Children and Adolescents.
Messiah, Sarah E., Michael D. Swartz, Rhiana A. Abbas, Yashar Talebi, Harold W. Kohl, III, Melissa Valerio-Shewmaker, Stacia M. DeSantis, Ashraf Yaseen, Steven H. Kelder, Jessica A. Ross, and et al.
Children 10, no. 5: 818. 2023. https://doi.org/10.3390/children10050818
Abstract
Objective: To describe COVID-19 illness characteristics, risk factors, and SARS-CoV-2 serostatus by variant time period in a large community-based pediatric sample. Design: Data were collected prospectively over four timepoints between October 2020 and November 2022 from a population-based cohort ages 5 to 19 years old. Setting: State of Texas, USA. Participants: Participants ages 5 to 19 years were recruited from large pediatric healthcare systems, Federally Qualified Healthcare Centers, urban and rural clinical practices, health insurance providers, and a social media campaign. Exposure: SARS-CoV-2 infection. Main Outcome(s) and Measure(s): SARS-CoV-2 antibody status was assessed by the Roche Elecsys® Anti-SARS-CoV-2 Immunoassay for detection of antibodies to the SARS-CoV-2 nucleocapsid protein (Roche N-test). Self-reported antigen or PCR COVID-19 test results and symptom status were also collected. Results: Over half (57.2%) of the sample (N = 3911) was antibody positive. Symptomatic infection increased over time from 47.09% during the pre-Delta variant time period, to 76.95% during Delta, to 84.73% during Omicron, and to 94.79% during the Omicron BA.2. Those who were not vaccinated were more likely (OR 1.71, 95% CI 1.47, 2.00) to be infected versus those fully vaccinated. Conclusions: Results show an increase in symptomatic COVID-19 infection among non-hospitalized children with each progressive variant over the past two years. Findings here support the public health guidance that eligible children should remain up to date with COVID-19 vaccinations.
Methodology to estimate natural- and vaccine-induced antibodies to SARS-CoV-2 in a large geographic region.
Stacia DeSantis, Luis Leon-Novelo, Michael Swartz, Ashraf Yaseen, Melissa Valerio, Yashar Talebi, Frances Brito, Jessica Ross, Harold Kohl III, Sarah Messiah, Steve Kelder, Leqing Wu, Shiming Zhang, Kimberly Aguillard, Michael Gonzalez, Onyinye Omega-Njemnob, David Lakey, Jennifer Shuford, Stephen Pont, Eric Boerwinkle.
PLOS ONE, 2022 Sep 9. PMID: 36084125 PMCID: PMC9462720 https://doi.org/10.1371/journal.pone.0273694
Abstract
Accurate estimates of natural and/or vaccine-induced antibodies to SARS-CoV-2 are difficult to obtain. Although model-based estimates of seroprevalence have been proposed, they require inputting unknown parameters including viral reproduction number, longevity of immune response, and other dynamic factors. In contrast to a model-based approach, the current study presents a data-driven detailed statistical procedure for estimating total seroprevalence (defined as antibodies from natural infection or from full vaccination) in a region using prospectively collected serological data and state-level vaccination data. Specifically, we conducted a longitudinal statewide serological survey with 88,605 participants 5 years or older with 3 prospective blood draws beginning September 30, 2020. Along with state vaccination data, as of October 31, 2021, the estimated percentage of those 5 years or older with naturally occurring antibodies to SARS-CoV-2 in Texas is 35.0% (95% CI = (33.1%, 36.9%)). This is 3× higher than, state-confirmed COVID-19 cases (11.83%) for all ages. The percentage with naturally occurring or vaccine-induced antibodies (total seroprevalence) is 77.42%. This methodology is integral to pandemic preparedness as accurate estimates of seroprevalence can inform policy-making decisions relevant to SARS-CoV-2.
Comparison of Persistent Symptoms Following SARS-CoV-2 Infection by Antibody Status in Nonhospitalized Children and Adolescents.
Sarah Messiah, Tianyao Hao, Stacia DeSantis, Michael Swartz, Yashar Talebi, Harold Kohl, Shiming Zhang, Melissa Valerio-Shewmaker, Ashraf Yaseen, Steven Kelder, Jessica Ross, Michael Gonzalez, Leqing Wu, Lindsay Padilla, Kourtney Lopez, David Lakey, Jennifer Shuford, Stephen Pont, Eric Boerwinkle.
The Pediatric Infectious Disease Journal. 2022;INF.0000000000003653. doi:10.1097/INF.0000000000003653
Abstract
Background: The prevalence of long-term symptoms of coronavirus disease 2019 (COVID-19) in nonhospitalized pediatric populations in the United States is not well described. The objective of this analysis was to examine the presence of persistent COVID symptoms in children by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) antibody status.
Methods: Data were collected between October 2020 and May 2022 from the Texas Coronavirus Antibody REsponse Survey, a statewide prospective population-based survey among 5-90 years old. Serostatus was assessed by the Roche Elecsys Anti-SARS-CoV-2 Immunoassay for detection of antibodies to the SARS-CoV-2 nucleocapsid protein. Self-reported antigen/polymerase chain reaction COVID-19 test results and persistent COVID symptom status/type/duration were collected simultaneously. Risk ratios for persistent COVID symptoms were calculated versus adults and by age group, antibody status, symptom presence/severity, variant, body mass index and vaccine status.
Results: A total of 82 (4.5% of the total sample [n = 1813], 8.0% pre-Delta, 3.4% Delta and beyond) participants reported persistent COVID symptoms (n = 27 [1.5%] 4–12 weeks, n = 58 [3.3%] >12 weeks). Compared with adults, all pediatric age groups had a lower risk for persistent COVID symptoms regardless of length of symptoms reported. Additional increased risk for persistent COVID symptoms >12 weeks included severe symptoms with initial infection, not being vaccinated and having unhealthy weight (body mass index ≥85th percentile for age and sex).
Conclusions: These findings highlight the existence of nonhospitalized youth who may also experience persistent COVID symptoms. Children and adolescents are less likely to experience persistent COVID symptoms than adults and more likely to be symptomatic, experience severe symptoms and have unhealthy weight compared with children/adolescents without persistent COVID symptoms.
Antibody duration after infection from SARS-CoV-2 in the Texas Coronavirus Antibody Response Survey.
Michael Swartz, Stacia DeSantis, Ashraf Yaseen, Frances Brito, Melissa Valerio-Shewmaker , Sarah E Messiah, Luis G Leon-Novelo, Harold Kohl, Cesar Pinzon-Gomez, Tianyao Hao, Shiming Zhang, Yashar Talebi, Joy Yoo, Jessica Ross, Michael O Gonzalez, Leqing Wu, Steven H Kelder, Mark Silberman, Samantha Tuzo, Stephen J Pont, Jennifer Shuford, David Lakey, Eric Boerwinkle.
Journal of Infectious Diseases. 2022;jiac167. doi:10.1093/infdis/jiac167
Abstract
Understanding the duration of antibodies to the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus that causes COVID-19 is important to controlling the current pandemic. Participants from the Texas Coronavirus Antibody Response Survey (Texas CARES) with at least 1 nucleocapsid protein antibody test were selected for a longitudinal analysis of antibody duration. A linear mixed model was fit to data from participants (n = 4553) with 1 to 3 antibody tests over 11 months (1 October 2020 to 16 September 2021), and models fit showed that expected antibody response after COVID-19 infection robustly increases for 100 days postinfection, and predicts individuals may remain antibody positive from natural infection beyond 500 days depending on age, body mass index, smoking or vaping use, and disease severity (hospitalized or not; symptomatic or not).
Durability of SARS-CoV-2 Antibodies From Natural Infection in Children and Adolescents.
Sarah Messiah, Stacia DeSantis, Luis Leon-Novelo, Yashar Talebi, Frances Brito, Harold Kohl, Melissa Valerio-Shewmaker, Jessica Ross, Michael Swartz, Ashraf Yaseen, Steven Kelder, Shiming Zhang, Onyinye Omega-Njemnobi, Michael Gonzalez, Leqing Wu, Eric Boerwinkle, David Lakey, Jennifer Shuford, Stephen Pont;
Pediatrics June 2022; 149 (6): e2021055505. doi: 10.1542/peds.2021-055505
Strategies to Estimate Prevalence of SARS-CoV-2 Antibodies in a Texas Vulnerable Population: Results From Phase I of the Texas Coronavirus Antibody Response Survey.
Melissa Valerio-Shewmaker, Stacia DeSantis, Michael Swartz, Ashraf Yaseen, Michael Gonzalez, Harold Kohl, Steven Kelder, Sarah Messiah, Kimberly Aguillard, Camille Breaux, Leqing Wu, Jennifer Shuford, Stephen Pont, David Lakey, Eric Boerwinkle.
Frontiers in Public Health. 2021 Dec 14;9:753487. https://doi.org/10.3389/fpubh.2021.753487 . PMID: 34970525; PMCID: PMC8712464.
Introduction: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection and immunity remains uncertain in populations. The state of Texas ranks 2nd in infection with over 2.71 million cases and has seen a disproportionate rate of death across the state. The Texas CARES project was funded by the state of Texas to estimate the prevalence of SARS-CoV-2 antibody status in children and adults. Identifying strategies to understand natural as well as vaccine induced antibody response to COVID-19 is critical.
Materials and Methods: The Texas CARES (Texas Coronavirus Antibody Response Survey) is an ongoing prospective population-based convenience sample from the Texas general population that commenced in October 2020. Volunteer participants are recruited across the state to participate in a 3-time point data collection Texas CARES to assess antibody response over time. We use the Roche Elecsys® Anti-SARS-CoV-2 Immunoassay to determine SARS-CoV-2 antibody status.
Results: The crude antibody positivity prevalence in Phase I was 26.1% (80/307). The fully adjusted seroprevalence of the sample was 31.5%. Specifically, 41.1% of males and 21.9% of females were seropositive. For age categories, 33.5% of those 18–34; 24.4% of those 35–44; 33.2% of those 45–54; and 32.8% of those 55+ were seropositive. In this sample, 42.2% (89/211) of those negative for the antibody test reported having had a COVID-19 test.
Conclusions: In this survey we enrolled and analyzed data for 307 participants, demonstrating a high survey and antibody test completion rate, and ability to implement a questionnaire and SARS-CoV-2 antibody testing within clinical settings. We were also able to determine our capability to estimate the cross-sectional seroprevalence within Texas's federally qualified community centers (FQHCs). The crude positivity prevalence for SARS-CoV-2 antibodies in this sample was 26.1% indicating potentially high exposure to COVID-19 for clinic employees and patients. Data will also allow us to understand sex, age and chronic illness variation in seroprevalence by natural and vaccine induced. These methods are being used to guide the completion of a large longitudinal survey in the state of Texas with implications for practice and population health.
Estimated Prevalence of SARS-CoV-2 Antibodies in the Texas Pediatric Population.
Sarah Messiah, Melissa Valerio-Shewmaker, Stacia DeSantis, Michael Swartz, Ashraf Yaseen, Frances Brito, Harold Kohl, Steven Kelder, Kimberly Aguillard, Onyinye Omega-Njemnobi, Camille Breaux, Jessica Ross, Michael Gonzalez, Shiming Zhang, Leqing Wu, David Lakey, Jennifer Shuford, Stephen Pont, Eric Boerwinkle.
The Lancet 2021, available at SSRN: https://ssrn.com/abstract=3868061 or http://dx.doi.org/10.2139/ssrn.3868061
Abstract
Background: The extent of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection in children and adolescents at a population level in 2021 remains unclear. Children and adolescents have been considered to play an important role in transmission regardless of presence of symptoms. The objective of this study was to estimate the prevalence of SARS-CoV-2 antibody status in children ages 5-to-19 years in a sample from the State of Texas in the presence and absence of symptoms.
Methods: The TX CARES (Texas Coronavirus Antibody Response Survey) is an ongoing prospective population-based sample from the Texas general population (~29·8 million) that commenced in October 2020. For the current analysis, volunteers ages 5-to-19 years were recruited throughout the state from large pediatric healthcare systems, Federally Qualified Healthcare Centers, urban and rural pediatric and family medicine practices, health insurance providers, and a social media campaign. SARS-CoV-2 antibody status measured via the Roche Elecsys® Anti-SARS-CoV-2 Immunoassay. Prevalence of IgM, IgA, or IgG antibodies were adjusted using sampling weights and post-stratification of age and sex. We report here findings from January 1-June 1, 2021.
Findings: This analysis that included 503 children ages 5-to-19 years showed 38·7% (95% CI, 34·5-43·1) of the sample was SARS-CoV-2 antibody positive in fully adjusted estimates. A total of 30·1% (95% CI, 29·4-31·5) of adults (N=6992) in the sample were antibody positive. Over half (55·8%) of those with positive antibody status were reportedly asymptomatic. Headache (25·4%), congestion/runny nose (20·8%), and fatigue (20·3%) and were the most frequently reported symptoms among those with a positive antibody result. The odds of having a positive antibody status was 37% higher in 15-to-19 year olds versus adults (OR 1·37, 95% CI, 1·03-1·83).
Identifying Individualized Risk Profiles for Radiotherapy-Induced Lymphopenia Among Patients With Esophageal Cancer Using Machine Learning.
Cong Zhu, Radhe Mohan, Steven Hsesheng Lin, Goo Jun, Ashraf Yaseen, Xiaoqian Jiang, Han Chen, Qianxia Wang, Wenhua Cao, Brian Hobbs.
JCO Clinical Cancer Informatics. 2021 Sep;5:1044-1053. doi: 10.1200/CCI.21.00098. PMID: 34665662; PMCID: PMC8812653.
Abstract
Purpose: Radiotherapy (RT)-induced lymphopenia (RIL) is commonly associated with adverse clinical outcomes in patients with cancer. Using machine learning techniques, a retrospective study was conducted for patients with esophageal cancer treated with proton and photon therapies to characterize the principal pretreatment clinical and radiation dosimetric risk factors of grade 4 RIL (G4RIL) as well as to establish G4RIL risk profiles.
Methods: A single-institution retrospective data of 746 patients with esophageal cancer treated with photons (n = 500) and protons (n = 246) was reviewed. The primary end point of our study was G4RIL. Clustering techniques were applied to identify patient subpopulations with similar pretreatment clinical and radiation dosimetric characteristics. XGBoost was built on a training set (n = 499) to predict G4RIL risks. Predictive performance was assessed on the remaining n = 247 patients. SHapley Additive exPlanations were used to rank the importance of individual predictors. Counterfactual analyses compared patients' risk profiles assuming that they had switched modalities.
Results: Baseline absolute lymphocyte count and volumes of lung and spleen receiving ≥ 15 and ≥ 5 Gy, respectively, were the most important G4RIL risk determinants. The model achieved sensitivitytesting-set 0.798 and specificitytesting-set 0.667 with an area under the receiver operating characteristics curve (AUCtesting-set) of 0.783. The G4RIL risk for an average patient receiving protons increased by 19% had the patient switched to photons. Reductions in G4RIL risk were maximized with proton therapy for patients with older age, lower baseline absolute lymphocyte count, and higher lung and heart dose.
Conclusion: G4RIL risk varies for individual patients with esophageal cancer and is modulated by radiotherapy dosimetric parameters. The framework for machine learning presented can be applied broadly to study risk determinants of other adverse events, providing the basis for adapting treatment strategies for mitigation.
Recommender system of scholarly papers using public datasets.
ӿJie Zhu, Braja Patra, Hulin Wu, §Ashraf Yaseen.
AMIA Jt Summits Transl Sci Proc. 2021 May 17;2021:672-679. PMID: 34457183; PMCID: PMC8378599
Abstract
The exponential growth of public datasets in the era of Big Data demands new solutions for making these resources findable and reusable. Therefore, a scholarly recommender system for public datasets is an important tool in the field of information filtering. It will aid scholars in identifying prior and related literature to datasets, saving their time, as well as enhance the datasets reusability. In this work, we developed a scholarly recommendation system that recommends research-papers, from PubMed, relevant to public datasets, from Gene Expression Omnibus (GEO). Different techniques for representing textual data are employed and compared in this work. Our results show that term-frequency based methods (BM25 and TF-IDF) outperformed all others including popular Natural Language Processing embedding models such as doc2vec, ELMo and BERT.
An Informatics Research Platform to Make Public Gene Expression Time-Course Datasets Reusable for More Scientific Discoveries.
Braja Patra, Babak Soltanalizadeh, Nan Deng, Leqing Wu, Vahed Maroufy, Wenjin Jim Zheng, Kirk Roberts, Hulin Wu, §Ashraf Yaseen.
Database, Volume 2020, 2020, PMID: 33247935 PMCID: PMC7698665 DOI: 10.1093/database/baaa074
Abstract
The exponential growth of genomic/genetic data in the era of Big Data demands new solutions for making these data findable, accessible, interoperable and reusable. In this article, we present a web-based platform named Gene Expression Time-Course Research (GETc) Platform that enables the discovery and visualization of time-course gene expression data and analytical results from the NIH/NCBI-sponsored Gene Expression Omnibus (GEO). The analytical results are produced from an analytic pipeline based on the ordinary differential equation model. Furthermore, in order to extract scientific insights from these results and disseminate the scientific findings, close and efficient collaborations between domain-specific experts from biomedical and scientific fields and data scientists is required. Therefore, GETc provides several recommendation functions and tools to facilitate effective collaborations. GETc platform is a very useful tool for researchers from the biomedical genomics community to present and communicate large numbers of analysis results from GEO. It is generalizable and broadly applicable across different biomedical research areas. GETc is a user-friendly and efficient web-based platform freely accessible at http://genestudy.org/.
A Novel Approach for Propensity Score Matching and Stratification in the Presence of Multiple Treatments: Application to an EHR-Derived Study of Subarachnoid Hemorrhage.
Derek W. Brown, Stacia M. DeSantis, Thomas J. Greene, Vahed Maroufi, Ashraf Yaseen, Hulin Wu, George Williams, Michael D. Swartz.
Statistics in Medicine. 39: 2308– 2323. 2020. https://doi.org/10.1002/sim.8540 . Epub 2020 Apr 16. PMID: 32297677; PMCID: PMC7334100.
Abstract
Currently, methods for conducting multiple treatment propensity scoring in the presence of high-dimensional covariate spaces that result from “big data” are lacking—the most prominent method relies on inverse probability treatment weighting (IPTW). However, IPTW only utilizes one element of the generalized propensity score (GPS) vector, which can lead to a loss of information and inadequate covariate balance in the presence of multiple treatments. This limitation motivates the development of a novel propensity score method that uses the entire GPS vector to establish a scalar balancing score that, when adjusted for, achieves covariate balance in the presence of potentially high-dimensional covariates. Specifically, the generalized propensity score cumulative distribution function (GPS-CDF) method is introduced. A one-parameter power function fits the CDF of the GPS vector and a resulting scalar balancing score is used for matching and/or stratification. Simulation results show superior performance of the new method compared to IPTW both in achieving covariate balance and estimating average treatment effects in the presence of multiple treatments. The proposed approach is applied to a study derived from electronic medical records to determine the causal relationship between three different vasopressors and mortality in patients with non-traumatic aneurysmal subarachnoid hemorrhage. Results suggest that the GPS-CDF method performs well when applied to large observational studies with multiple treatments that have large covariate spaces.
Vasopressor Treatment and Mortality Following Non-Traumatic Subarachnoid Hemorrhage: A Nationwide EHR Analysis.
George Williams, Vahed Maroufy, Laila Rasmy, Derek Brown, Duo Yu, Hai Zhu, Yashar Talebi, Xueying Wang, Emy Thomas, Gen Zhu, Ashraf Yaseen, Hongyu Miao, Luis Leon Novelo, Degui Zhi, Stacia DeSantis, Hongjian Zhu, Jose-Miguel Yamal, David Aguilar, and Hulin Wu.
Neurosurgical Focus. 2020 May 1;48(5):E4. PMID: 32357322 DOI: 10.3171/2020.2.FOCUS191002
Abstract
Objective: Subarachnoid hemorrhage (SAH) is a devastating cerebrovascular condition, not only due to the effect of initial hemorrhage, but also due to the complication of delayed cerebral ischemia (DCI). While hypertension facilitated by vasopressors is often initiated to prevent DCI, which vasopressor is most effective in improving outcomes is not known. The objective of this study was to determine associations between initial vasopressor choice and mortality in patients with nontraumatic SAH.
Methods: The authors conducted a retrospective cohort study using a large, national electronic medical record data set from 2000-2014 to identify patients with a new diagnosis of nontraumatic SAH (based on ICD-9 codes) who were treated with the vasopressors dopamine, phenylephrine, or norepinephrine. The relationship between the initial choice of vasopressor therapy and the primary outcome, which was defined as in-hospital death or discharge to hospice care, was examined.
Results: In total, 2634 patients were identified with nontraumatic SAH who were treated with a vasopressor. In this cohort, the average age was 56.5 years, 63.9% were female, and 36.5% of patients developed the primary outcome. The incidence of the primary outcome was higher in those initially treated with either norepinephrine (47.6%) or dopamine (50.6%) than with phenylephrine (24.5%). After adjusting for possible confounders using propensity score methods, the adjusted OR of the primary outcome was higher with dopamine (OR 2.19, 95% CI 1.70-2.81) and norepinephrine (OR 2.24, 95% CI 1.80-2.80) compared with phenylephrine. Sensitivity analyses using different variable selection procedures, causal inference models, and machine-learning methods confirmed the main findings.
Conclusions: In patients with nontraumatic SAH, phenylephrine was significantly associated with reduced mortality in SAH patients compared to dopamine or norepinephrine. Prospective randomized clinical studies are warranted to confirm this finding.
Gene expression dynamic analysis reveals co-activation of Sonic Hedgehog and epidermal growth factor followed by dynamic silencing.
Vahed Maroufy, Pankil Shah, Arvand Asghari, Nan Deng, Rosemarie Le, Juan Camilo Ramírez, Ashraf Yaseen, W. Zheng, Michihisa Umetani, Hulin Wu.
Oncotarget. 2020 Apr 14;11(15):1358-1372. PMID: 32341755 PMCID: PMC7170495 DOI: 10.18632/oncotarget.27547
Abstract
Aberrant activation of the Sonic Hedgehog (SHH) gene is observed in various cancers. Previous studies have shown a "cross-talk" effect between the canonical Hedgehog signaling pathway and the Epidermal Growth Factor (EGF) pathway when SHH is active in the presence of EGF. However, the precise mechanism of the cross-talk effect on the entire gene population has not been investigated. Here, we re-analyzed publicly available data to study how SHH and EGF cooperate to affect the dynamic activity of the gene population. We used genome dynamic analysis to explore the expression profiles under different conditions in a human medulloblastoma cell line. Ordinary differential equations, equipped with solid statistical and computational tools, were exploited to extract the information hidden in the dynamic behavior of the gene population. Our results revealed that EGF stimulation plays a dominant role, overshadowing most of the SHH effects. We also identified cross-talk genes that exhibited expression profiles dissimilar to that seen under SHH or EGF stimulation alone. These unique cross-talk patterns were validated in a cell culture model. These cross-talk genes identified here may serve as valuable markers to study or test for EGF co-stimulatory effects in an SHH+ environment. Furthermore, these cross-talk genes may play roles in cancer progression, thus they may be further explored as cancer treatment targets.
A Load-Balancing Workload Distribution Scheme for Three-Body Interaction Computation on Graphics Processing Units (GPU).
†Ashraf Yaseen, Hao Ji, and Yaohang Li.
Journal of Parallel and Distributed Computing, 87: 91–101, 2016. https://doi.org/10.1016/j.jpdc.2015.10.003
Abstract
Three-body effects play an important role for obtaining quantitatively high accuracy in a variety of molecular simulation applications. However, evaluation of three-body potentials is computationally costly, generally of O(N3) where N is the number of particles in a system. In this paper, we present a load-balancing workload distribution scheme for calculating three-body interactions by taking advantage of the Graphics Processing Units (GPU) architectures. Perfect load-balancing is achieved if N is not divisible by 3 and nearly perfect load-balancing is obtained if N is divisible by 3. The workload distribution scheme is particularly suitable for the GPU’s Single Instruction Multiple Threads (SIMT) architecture, where particle’s data accessed by threads can be coalesced into efficient memory transactions. We use two potential energy functions with three-body terms, the Axilrod–Teller potential and the Context-based Secondary Structure Potential, as examples to demonstrate the effectiveness of our workload distribution scheme.
FLEXc: protein flexibility prediction using context-based statistics, predicted structural features, and sequence information.
†Ashraf Yaseen, Mais Nijim, Brandon Williams, Lei Qian, Min Li, Jianxin Wang, and Yaohang Li.
BMC Bioinformatics, vol. 17 Suppl 8, pp. 281, 2016. PMID: 27587065 PMCID: PMC5009531 DOI: 10.1186/s12859-016-1117-3
Abstract
Background: The fluctuation of atoms around their average positions in protein structures provides important information regarding protein dynamics. This flexibility of protein structures is associated with various biological processes. Predicting flexibility of residues from protein sequences is significant for analyzing the dynamic properties of proteins which will be helpful in predicting their functions.
Results: In this paper, an approach of improving the accuracy of protein flexibility prediction is introduced. A neural network method for predicting flexibility in 3 states is implemented. The method incorporates sequence and evolutionary information, context-based scores, predicted secondary structures and solvent accessibility, and amino acid properties. Context-based statistical scores are derived, using the mean-field potentials approach, for describing the different preferences of protein residues in flexibility states taking into consideration their amino acid context. The 7-fold cross validated accuracy reached 61 % when context-based scores and predicted structural states are incorporated in the training process of the flexibility predictor.
Conclusions: Incorporating context-based statistical scores with predicted structural states are important features to improve the performance of predicting protein flexibility, as shown by our computational results. Our prediction method is implemented as web service called “FLEXc” and available online at: http://hpcr.cs.odu.edu/flexc.
HuBum: Energy Efficient Hybrid Mobile Storage Systems using Solid States and Buffer Disks.
Mais Nijim and Ashraf Yaseen.
Journal of Computer Communication and Collaboration, 2015. (DOIC: 2292-1036-2015-04-001-59)
Context-based Features Enhance Protein Secondary Structure Prediction Accuracy.
†Ashraf Yaseen and Yaohang Li.
Journal of Chemical Information and Modeling, 54 (3), pp 992–1002, 2014. PMID: 24571803 DOI: 10.1021/ci400647u
Abstract
We report a new approach of using statistical context-based scores as encoded features to train neural networks to achieve secondary structure prediction accuracy improvement. The context-based scores are pseudo-potentials derived by evaluating statistical, high-order inter-residue interactions, which estimate the favorability of a residue adopting certain secondary structure conformation within its amino acid environment. Encoding these context-based scores as important training and prediction features provides a way to address a long-standing difficulty in neural network-based secondary structure predictions of taking interdependency among secondary structures of neighboring residues into account. Our computational results have shown that the context-based scores are effective features to enhance the prediction accuracy of secondary structure predictions. An overall 7-fold cross-validated Q3 accuracy of 82.74% and Segment Overlap Accuracy (SOV) accuracy of 86.25% are achieved on a set of more than 7987 protein chains with, at most, 25% sequence identity. The Q3 prediction accuracy on benchmarks of CB513, Manesh215, Carugo338, as well as CASP9 protein chains is higher than popularly used secondary structure prediction servers, including Psipred, Profphd, Jpred, Porter (ab initio), and Netsurf. More significant improvement is observed in the SOV accuracy, where more than 4% enhancement is observed, compared to the server with the best SOV accuracy. A Q8 accuracy of >70% (71.5%) is also found in eight-state secondary structure prediction. The majority of the Q3 accuracy improvement is contributed from correctly identifying β-sheets and α-helices. When the context-based scores are incorporated, there are 15.5% more residues predicted with >90% confidence. These high-confidence predictions usually have a rather high accuracy (averagely ~95%). The three- and eight-state prediction servers (SCORPION) implementing our methods are available online.
Template-based C8-SCORPION: a protein 8-state secondary structure prediction method using structural information and context-based features.
†Ashraf Yaseen and Yaohang Li.
BMC Bioinformatics,15(Suppl 8):S3, 2014. PMID: 25080939 PMCID: PMC4120151 DOI: 10.1186/1471-2105-15-S8-S3
Abstract
Background: Secondary structures prediction of proteins is important to many protein structure modeling applications. Correct prediction of secondary structures can significantly reduce the degrees of freedom in protein tertiary structure modeling and therefore reduces the difficulty of obtaining high resolution 3D models.
Methods: In this work, we investigate a template-based approach to enhance 8-state secondary structure prediction accuracy. We construct structural templates from known protein structures with certain sequence similarity. The structural templates are then incorporated as features with sequence and evolutionary information to train two-stage neural networks. In case of structural templates absence, heuristic structural information is incorporated instead.
Results: After applying the template-based 8-state secondary structure prediction method, the 7-fold cross-validated Q8 accuracy is 78.85%. Even templates from structures with only 20%~30% sequence similarity can help improve the 8-state prediction accuracy. More importantly, when good templates are available, the prediction accuracy of less frequent secondary structures, such as 3-10 helices, turns, and bends, are highly improved, which are useful for practical applications.
Conclusions: Our computational results show that the templates containing structural information are effective features to enhance 8-state secondary structure predictions. Our prediction algorithm is implemented on a web server named "C8-SCORPION" available at: http://hpcr.cs.odu.edu/c8scorpion.
Software Defined Radio Laboratory Platform for Enhancing Undergraduate Communication and Networking Curricula.
Zhiqiang Wu, Bin Wang, Chi-Hao Cheng, Dr. Deng Cao, and Ashraf Yaseen.
ASEE Conference, 2014.
Dinosolve: A Protein Disulfide Bonding Prediction Server using Context-based Features to Enhance Prediction Accuracy.
†Ashraf Yaseen and Yaohang Li.
BMC Bioinformatics, 14(Suppl 13):S9, 2013. PMID: 24267383 PMCID: PMC3849605 DOI: 10.1186/1471-2105-14-S13-S9
Abstract
Background: Disulfide bonds play an important role in protein folding and structure stability. Accurately predicting disulfide bonds from protein sequences is important for modeling the structural and functional characteristics of many proteins.
Methods: In this work, we introduce an approach of enhancing disulfide bonding prediction accuracy by taking advantage of context-based features. We firstly derive the first-order and second-order mean-force potentials according to the amino acid environment around the cysteine residues from large number of cysteine samples. The mean-force potentials are integrated as context-based scores to estimate the favorability of a cysteine residue in disulfide bonding state as well as a cysteine pair in disulfide bond connectivity. These context-based scores are then incorporated as features together with other sequence and evolutionary information to train neural networks for disulfide bonding state prediction and connectivity prediction.
Results: The 10-fold cross validated accuracy is 90.8% at residue-level and 85.6% at protein-level in classifying an individual cysteine residue as bonded or free, which is around 2% accuracy improvement. The average accuracy for disulfide bonding connectivity prediction is also improved, which yields overall sensitivity of 73.42% and specificity of 91.61%.
Conclusions: Our computational results have shown that the context-based scores are effective features to enhance the prediction accuracies of both disulfide bonding state prediction and connectivity prediction. Our disulfide prediction algorithm is implemented on a web server named "Dinosolve" available at: http://hpcr.cs.odu.edu/dinosolve.
Pareto-based Optimal Sampling Method and Its Applications in Protein Structural Conformation Sampling.
Yaohang Li and Ashraf Yaseen.
AAAI Workshop on Artificial Intelligence and Robotics Methods in Computational Biology, Bellevue, 2013. https://cdn.aaai.org/ocs/ws/ws1041/7112-30495-1-PB.pdf
Abstract
Efficiently sampling the protein conformation space is a critical step in de novo protein structure modeling. One of the important challenges in sampling is the inaccuracy of available scoring functions, i.e., a scoring function is not always sufficiently accurate to distinguish the correct conformations from the alternatives and thereby exploring the very minimum of a scoring function does not necessary reveal correct conformations. In this paper, we present a Pareto optimal sampling (POS) method to address the inaccuracy problem of scoring functions. The POS method adopts a new computational sampling strategy by exploring diversified conformations on the Pareto optimal front in the function space consisted of multiple scoring functions, representing consensus with different trade-offs among multiple scoring functions. Our computational results in protein loop structure sampling and protein backbone structure sampling have demonstrated the effectiveness of the POS method, where near-natives are found in the ensemble of Pareto-optimal conformations.
Predicting Protein Solvent Accessibility with Sequence, Evolutionary Information and Context-based Features.
†Ashraf Yaseen and Yaohang Li.
Biotechnology and Bioinformatics Symposium, (BIOT2013) Provo, 2013.
Abstract — Solvent-accessible surface areas of residues in proteins are key factors in protein folding. Predicting solvent accessibility from protein sequences is significant for modeling the structural and functional characteristics of many proteins. In this work, we introduce an pproach of enhancing solvent accessibility prediction accuracy. We derive pseudo-potentials, by considering high-order inter-residue interactions, according to the amino acid environment around protein residues from large number of protein samples.
These context-dependent pseudo-potentials are integrated as scores to estimate the favorability of a residue in solvent accessibility state. The context-based scores are then incorporated as features together with other sequence and evolutionary information to train 2-stage neural networks for solvent accessibility prediction. Our computational results have shown that the context-based scores are effective features to enhance the prediction accuracies of protein solvent accessibility. The 7-fold cross validated Q2 accuracy reached 80.76% when context-based scores are incorporated in the training process of the solvent accessibility predictor.
Template-based Prediction of Protein 8-states Secondary Structures.
†Ashraf Yaseen and Yaohang Li.
IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS2013), New Orleans 2013. DOI: 10.1109/ICCABS.2013.6629216
Abstract:
Accurately predicting protein secondary structures is important to many protein structure modeling applications. In this paper, we investigate a template-based approach to enhance 8-state secondary structure prediction accuracy. The rationale is to construct structural templates from known protein structures with certain sequence similarity. The information contained in templates is then incorporated as features with sequence, evolutionary, and heuristic information to train neural networks. Our computational results show that templates containing structural information are effective features to enhance 8-state secondary structure prediction. A 7-fold cross-validated Q8 score of 78.85% is obtained.
Enhancing Protein Disulfide Bonding Prediction Accuracy with Context-based Features.
†Ashraf Yaseen and Yaohang Li.
Proceedings of Biotechnology and Bioinformatics Symposium, (BIOT2012), Provo, 2012.
Abstract — Accurately predicting protein disulfide bonds from sequences is important for modeling the structural and functional characteristics of many proteins. In this paper, we introduce a new approach to enhance disulfide bonding prediction accuracy. We firstly generate the first-order and second-order mean-force potentials according to the amino acid environment around cysteine residues from large number of cysteine samples. The mean-force potentials are integrated as context-based scores to estimate the favorability of a cysteine residue in disulfide bonding state as well as a cysteine pair in disulfide bond connectivity. These context-based scores are then incorporated as features together with other protein sequence and evolutionary information to train neural networks for disulfide bonding state prediction and connectivity prediction. Our computational results have shown that the context-based scores are effective features to enhance the prediction accuracies of both disulfide bonding state prediction and connectivity prediction. The 10-fold cross validated accuracy is 90.8% at residue-level and 85.6% at protein-level in classifying an individual cysteine residue as bonded or free, which is around 2% accuracy improvement. The average accuracy for disulfide bonding connectivity prediction is improved as well, which yields overall sensitivity of 73.42% and specificity of 91.61%.
Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units.
†Ashraf Yaseen and Yaohang Li.
Journal of Parallel and Distributed Computing, 72(2): 297-307, 2012. https://doi.org/10.1016/j.jpdc.2011.10.005
Abstract
Evaluating the energy of a protein molecule is one of the most computationally costly operations in many protein structure modeling applications. In this paper, we present an efficient implementation of knowledge-based energy functions by taking advantage of the recent Graphics Processing Unit (GPU) architectures. We use DFIRE, a knowledge-based all-atom potential, as an example to demonstrate our GPU implementations on the latest NVIDIA Fermi architecture. A load balancing workload distribution scheme is designed to assign computations of pair-wise atom interactions to threads to achieve perfect or near-perfect load balancing in the symmetric N-body problem in DFIRE. Reorganizing atoms in the protein also improves the cache efficiency in Fermi GPU architecture, which is particularly effective for small proteins. Our DFIRE implementation on GPU (GPU-DFIRE) has exhibited a speedup of up to ~150 on NVIDIA Quadro FX3800M and ~250 on NVIDIA Tesla M2050 compared to the serial DFIRE implementation on CPU. Furthermore, we show that protein structure modeling applications, including a Monte Carlo sampling program and a local optimization program, can benefit from GPU-DFIRE with little programming modification but significant computational performance improvement.
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture.
Weihang Zhu, Ashraf Yaseen and Yaohang Li.
New Generation Computing, 29(2): 163-184, 2011.
In this paper, we present an efficient method implemented on Graphics Processing Unit (GPU), DEMCMC-GPU, for multi-objective continuous optimization problems. The DEMCMC-GPU kernel is the DEMCMC algorithm, which combines the attractive features of Differential Evolution (DE) and Markov Chain Monte Carlo (MCMC) to evolve a population of Markov chains toward a diversified set of solutions at the Pareto optimal front in the multi-objective search space. With parallel evolution of a population of Markov chains, the DEMCMC algorithm is a natural fit for the GPU architecture. The implementation of DEMCMC-GPU on the pre-Fermi architecture can lead to a ~25 speedup on a set of multi-objective benchmark function problems, compare to the CPU-only implementation of DEMCMC. By taking advantage of new cache mechanism in the emerging NVIDIA Fermi GPU architecture, efficient sorting algorithm on GPU, and efficient parallel pseudorandom number generators, the speedup of DEMCMC-GPU can be aggressively improved to ~100.
Performance Evaluation of Oracle Semantic Technologies with respect to User Defined Rules.
†Ashraf Yaseen, Kurt J. Maly, Steven J. Zeil and Mohammad Zubair.
Proceeding of Database and Expert Systems Applications, DEXA, International Workshops, Toulouse, France, August 29, 2011. DOI: 10.1109/DEXA.2011.65
Abstract:
Ontology-based reasoning systems have a native rule base but allow also for the addition of application domain-specific rules. Previous work, comparing the performance of these systems, mainly considered performance with supported rule bases. In this paper we present an evaluation of Oracle as an ontology reasoning system with respect to domain-specific rule bases, in the context of a question/answer system called Science Web.
Books
Statistics and Machine Learning Methods for EHR Data, From Data Extraction to Data Analytics.
Hulin Wu, Jose-Miguel Yamal, Ashraf Yaseen, and Vahed Maroufy.
United States: CRC Press, 2020.
Editor & Co-author of chapters:
Ch2: EHR Project Management
Ch3: EHR Databases and Data Management: Data Query and Extraction
Ch9: Neural Network and Deep Learning Methods for EHR Data
Ch10: EHR Data Analytics and Predictions: Machine Learning Methods
Invited Talks
Lack of antibody response in those vaccinated or with natural exposure. [Session: Examining SARS-CoV2 Response Over Time Using a Longitudinal Design: Texas CARES Survey].
American Public Health Association (APHA). November 7, 2022. Boston, MA.
Texas CARES Community Update, Lessons Learned and Next Steps.
Healthier Texas Summit. October 21, 2022. Austin TX.
Epidemiology Special Session: Understanding the Human Antibody Response to Sars-Cov-2 in Diverse Populations: The Texas Coronavirus Antibody Response Survey (CARES). Data Management and Visualization.
American Public Health Association (APHA). October 25, 2021, Denver, Colorado.
Understanding the Human Antibody Response to SARS-CoV-2 in Diverse Populations: The Texas Coronavirus Antibody Response Survey (CARES).
World Health Organization (WHO) Solidarity II. August 27, 2021.
Texas C.A.R.E.S. Coronavirus Antibody REsponse Survey. Texas CARES Portal: An Interactive Platform with Visualizations, Maps, and Summary Statistics to Illustrate and Understand the Human Response to COVID-19.
Texas Department of State Health Services (DSHS) Grand Rounds. June 16, 2021.
Posters
*Jie Zhu, Braja Patra, Hulin Wu and Ashraf Yaseen. Recommender system of scholarly papers using public datasets. ICSA Applied Statistics Symposium. Houston, Texas. December 13-16, 2020.
*Praveenraj Uthamarajan and Ashraf Yaseen, “Analysis of Systems using Distributed Consensus Algorithms”. College of Engineering-TAMUK, 2017.
*Megha Lalluvadia and Ashraf Yaseen, "Applications of Text Classification". College of Engineering-TAMUK, 2017.
*Varun Agrawal, Gaurav Dokania, and Ashraf Yaseen, “Predicting protein flexibility and disorder”. Texas A&M University System 12th Annual Pathways Student Research Symposium, Corpus Christi, TX, 2015.
*Anurag Gupta, Hridya Gopalakrishna, and Ashraf Yaseen, “Predicting protein solvent accessibility”. Texas A&M University System 12th Annual Pathways Student Research Symposium, Corpus Christi, TX, 2015.
†Ashraf Yaseen, Mais Nijim, Brandon Williams, Lei Qian, and Yaohang Li “Predicting Protein Flexibility using Context-based Statistics, Predicted Structural Features, and Sequence Information”. 11th International Symposium on Bioinformatics Research and Applications (ISBRA), Norfolk, Virginia, 2015. (Awarded #1 best poster).
†Ashraf Yaseen, Akeem Edwards and Yaohang Li, “Improving Intermediate Steps in ab initio Protein Molding”,14th Annual Tidewater Student Research Poster Session at Christopher Newport University. Nov, 2012.