Are you ADEPT at explaining complex data science concepts to non-data scientists? If you want to improve your communication skills then this is the workshop for you. Last week I was in a meeting where Virginia Tech data scientists were presenting results to a company that is very early in its data science journey. The comment from one of the project leaders was “Well, we barely understand some of the concepts you are discussing so there is no way that our lower level operations team will get this---especially since English is their second language.” This comment illustrates the importance of communication skills in team-based data science. The end users of your modeling and analysis have to understand how the information you provide them was derived and how it should and shouldn’t be used. You will have ongoing team dialogues throughout projects that rely on some level of shared understanding of data science concepts.
But aren’t great communicators “just born with it”? Certainly, some people are naturally talented at communication but everyone can learn tips and techniques to improve their communication skills. ADEPT is one such framework. ADEPT stands for analogy, diagram, example, plain English, and technical definition. Data scientists excel at the technical definition element of ADEPT but have little understanding of how to develop the other four components. This workshop includes an introduction to the framework, examples, and a group activity where you get to design your own ADEPT framework for the data science concept of your choice.
This workshop will cover the fundamental basics of using Git to track changes across scripts and folders and the steps needed to commit and push to GitHub. Workshop participants should have RStudio (Posit), Git, and GitHub desktop downloaded prior to beginning the workshop, and have created a username in GitHub. During the workshop, we will work through participants connecting their local folder to GitHub and conducting their first commit and push to their online repository.
Data sources and the volume of data available for driving discovery and informing decisions have substantially increased over time. This increase has resulted in an evolving data landscape ripe for the expertise of statisticians and data scientists. We must play a key role to ensure the appropriate use of data and soundness of conclusions reached from analyses of the data. In this talk, I will explore the data landscape identifying challenges and opportunities and highlighting our contributions and impact.
With the growing reliance on open-world or outsourced training data for building machine learning models, the risk of malicious data manipulation has increased. This talk will explore backdoor attacks, a type of attack where the attacker manipulates the training data to implant malicious hidden functionality into a model. We will discuss our research on understanding vulnerabilities to backdoor attacks and developing effective countermeasures to protect machine learning systems.
Code translation is a fundamental task that involves converting source code from one programming language to another. While traditional rule-based code translation is time-consuming and demanding in domain expertise, recent advances in machine learning techniques, particularly neural machine translation (NMT) methods, have shown promising results in automating code translation. However, the scarcity of parallel code data, which is essential for training and evaluating code translation models, remains a significant challenge. To address this challenge, this paper formalizes the data augmentation approach for code translation as a filtered back-translation framework, which consists of two modules: a hypothesis generator and a filterer. Unlike existing work, the proposed approach optimizes for parallelity and correctness separately, achieving a better balance between data quality and cost. The approach leverages static code analysis for parallelity and compilation for correctness, enabling the generation of high-quality pseudo parallel code data in a low-cost manner. Extensive experiments show that the proposed approach is particularly effective for low-resource languages and can improve the performance of a weak generator in an iterative manner without requiring unit-testing or strong domain expertise.
This project works in conjunction with BlackBRAND and their 150 Year Plan to bolster economic and social progress for Black communities in the Hampton Roads region. We sought to expand the data and analysis for their Media and Entertainment pillar by using Natural Language Processing to measure positive and negative sentiment as well as racial diversity in media across multiple local news sources. We wrote an algorithm with some guidance from Dr. Wenskovitch to measure the sentiment values of nearly 200 articles from local sources about arts and entertainment. We found a strong positive sentiment across these sources that consistently outweighed the negative sentiment, and we can track how diversity and inclusion in local media changes over time with this algorithm. Finally, we expanded the BlackBRAND dashboard to display our findings. We hope that this expanded development of the dashboard will allow for BlackBRAND to expand their mission through a visualized narrative and to utilize these tools to establish narratives for works in their other pillars as well.
The purpose of this work is to design a modeling framework to hindcast subterranean nest temperatures back to an initial lay date, given hourly nest temperature data from the date of nest discovery forward. The hindcasts from this model will be used in a future project investigating the effect of nest thermal environment on hatching success. Publicly available climatic data from the Daymet Daily Surface Weather and Climatological Summaries and the PERSIANN (Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks) system developed by the Center for Hydrometeorology and Remote Sensing (CHRS) at the University of California, Irvine (UCI) were used to improve the quality of hindcasts produced from the incomplete series of iButton data logger recorded temperatures alone. Several models were fitted and tested on one nest to find the best approach. Then, the process of model fitting using the most informative predictors was automated to expand model fitting and testing to the other 199 nests. Hindcasts produced from a linear model and adjusted using hindcasted residuals produced using an AR(2) model of the residuals proved to be reasonable accurate for most of the nests.
Glacial isostatic adjustment (GIA) describes the response of the solid Earth, oceans, and gravitational field to the spatio-temporal evolution of global ice sheets during a glacial cycle. It is proposed that the Chesapeake Bay is subsiding due to the collapse of a glacial forebulge in response to the melting of the Laurentide Ice Sheet following the Last Glacial Maximum. As a result, the Chesapeake Bay is a hot spot of relative sea-level rise along the North American Atlantic Coast. We evaluate the influence of GIA on vertical land motions and sea level changes in the Chesapeake Bay using at least 3 glacial retreat models and a range of Earth model structure parameters. We hypothesize that GIA is contributing to land subsidence and sea-level rise in the Chesapeake Bay with a range in estimates depending on input parameters and structural differences of GIA models. We use the open source software SELEN4.0 (a SealEveL EquatioN solver) to investigate the effects of GIA using a suite of radial viscosity structures and glacial retreat models. We analyze ‘glacial isostatic adjustment fingerprints’ of vertical displacements and sea level changes. Further, we evaluate the uncertainties of GIA modeling associated with input parameters and structural differences among GIA models using an ensemble approach. Vertical displacement and present-day sea-level change rate estimates range from approximately -0.5 to -2.0 mm/yr and 0.5 to 2.0 mm/yr, respectively, depending on the ice sheet model. We conclude that, overall, GIA is producing negative vertical land motions in the Chesapeake Bay, which is contributing to accelerated rates of sea-level rise in this region; however there are a number of uncertainties that should be taken into account for GIA modeling. Accurate estimates of vertical displacement and sea-level rise carry important economic, ecological, and coastal hazard implications for the densely populated Chesapeake Bay region.
There is a fundamental tension between calibration and boldness of probability predictions about forthcoming events. Predicted probabilities are considered well calibrated when they are consistent with the relative frequency of the events they aimed to predict. However, well calibrated predictions are not necessarily useful. Predicted probabilities are considered more bold when they are further from the base rate and closer to the extremes of 0 or 1. Predictions that are reasonably bold, while maintaining calibration, are more useful for decision making than those with only one or the other. We develop Bayesian estimation and hypothesis testing-based methodology with a likelihood suited to the probability calibration problem. Our approach effectively identifies and corrects miscalibration. Additionally, it allows users to maximize boldness while maintaining a user specified level of calibration, providing an interpretable tradeoff between the two. While we demonstrate the practical capabilities of this methodology by comparing hockey pundit predictions, this approach is widely applicable across many fields.
Factor models are widely used to identify meaningful latent structures in multivariate data. Here, we introduce a Bayesian Clustering Factors Model (BCFM) that defines clusters through Bayesian factors and the Gaussian mixture model. BCFM can be applied to high dimensional multivariate data and reduce the dimensionality. In our model, the structures of the mean and covariance of the common factors are different by clusters, but the factor loadings remain the same. In order to assure the clusters do not swap during the MCMC process, we set the covariance structure of the largest cluster from the initial step to have a diagonal matrix. The other clusters are allowed to have non-diagonal covariance structures. Using this way, we are defining the common factors with respect to the largest cluster. In the applications we consider, it is more reasonable to assume that the interpretations of the factors are identical across the clusters, and therefore the clusters share the same factor loadings.
The northern Western Branch of the East African Rift System (EARS) consists of two segments, the magma-rich segment and the magma-poor segment. The magma-poor segment is found in the western and northwestern regions of Uganda, whereas the magma-rich segment is located in the southwestern region of Uganda. In this study, we investigate magma-poor rifting processes using mantle convection, lithospheric dynamics code ASPECT, and GNSS (GPS) measurements. Our study has three main objectives: First, we investigate sources of melt below the lithosphere by modeling melt generation and lithospheric modulated convection. Results from this study indicate melt is unlikely to be the weakening mechanism for the magma-poor segment. Second, in our ongoing work, we investigate the role of pre-existing structures in the initiation of magma-poor rifting segments through comparisons between predicted (modeled) and observed fault offsets. Fault locations are constrained by observed fault traces and assumed average dips, which are parameterized in the Geodynamic World Builder (GWB) software package. The GWB's mesh-independent representation of the faults provides the initial conditions for the ASPECT simulations, which achieve high resolution in the faults using adaptive mesh refinement. Future work will focus on constraining the kinematics of the northern Western Branch using GNSS (GPS) data and block kinematic modeling with the TDEFNODE software. This research will help advance our understanding of magma-poor continental rifting processes.
Figure: Map showing the location of the study site. The plate boundaries are shown as a red dashed line. Earthquakes from the National Earthquake Information Center catalog from 2000 to 2022 are shown as small white stars, triangles, and diamonds. Holocene volcanoes are red triangles. Major faults in the region are represented by black, red, magenta, blue, brown, and gray lines. BF = Bunia Fault; BWF = Bwamba Fault; TF = Tonya Fault; TBF = North Toro Bunyoro Fault; RWF = Rwimi- Wasa Fault; GF = George Fault.
One of the challenges in contrastive learning is the selection of appropriate hard negative examples in the absence of label information. Random sampling or importance sampling methods based on feature similarity often lead to sub-optimal performance. In this work, we introduce UnDiMix, a hard negative sampling strategy that takes into account anchor similarity, model uncertainty, and diversity. Experimental results on several benchmarks show that UnDiMix improves negative sample selection and, subsequently, downstream performance when compared to state-of-the-art contrastive learning methods.
Genome-wide association study (GWAS) aims to identify genetic variations associated with a given phenotype. Single marker analysis (SMA) based on linear mixed models (LMMs) is a common approach for GWAS analysis. However, LMM-based SMA has high false positive rate and cannot be directly applied to non-Gaussian phenotypes. We present a novel Bayesian method to find single nucleotide polymorphisms (SNPs) associated with non-Gaussian phenotypes from genome-wide association studies (GWAS). To analyze non-Gaussian phenotypes we use generalized linear mixed models (GLMMs). We call our method Bayesian GLMMs for GWAS (BG2). To deal with the high dimensionality of GWAS analysis, we propose novel nonlocal priors specifically tailored to GLMMs and develop related fast approximate computations for Bayesian model selection. To search through hundreds of thousands of possible SNPs, BG2 uses a two-step procedure: first, BG2 screens for candidate SNPs; second, BG2 performs model search that considers all screened candidate SNPs as possible regressors.