Day 1 (3/5) 10:00 - 10:50
An information geometrical structure of a determinantal point process
We investigate the information geometrical structure of a determinantal point process (DPP). It demonstrates that a DPP is embedded in the exponential family of log-linear models. The extent of deviation from an exponential family is analyzed using the e-embedding curvature tensor, which identifies partially flat parameters of a DPP. On the basis of this embedding structure, an information-geometrical relationship between a marginal kernel and an L-ensemble kernel is discovered. This is a joint work with Hideitsu Hino (ISM).
Algorithms of finding independent holes in a network
Betti numbers are the best known topological indices and are the numbers of independent holes of specified dimensions. Counting Betti numbers has become a common practice in computational topology, but identifying the independent holes of an arbitrary dimension remains largely unexplored. We propose three algorithms to find independent holes in an undirected network. The global algorithm utilizes the entire boundary matrices of the network and identifies a complete list of independent holes. Both semi-local and local algorithms divide the network into many small neighborhood subgraphs, and employ the global algorithm to find the locally independent holes in selected neighborhood subgraphs but apply different criteria to filter out the dependent holes on the list. The semi-local algorithm exploits the block structure of the boundary matrices and finds a (not necessarily complete) list of globally independent holes. The local algorithm narrows down locally independent holes without using information about the entire boundary matrices and finds a list of putatively independent holes. We prove the accuracy of both global and semi-local algorithms and completeness of the global algorithm, assess their computational complexity, and experimentally validate these algorithms on five networks of varying sizes in dimensions 0-5.
Day 1 (3/5) 11:10 - 12:00
An image comparison method based on local pixel clustering
Image comparison is a stepping stone for monitoring images which has wide applications in satellite imaging, medical research, defense, and many others. Since the image intensity functions are discontinuous, and the observed images often contain noise, the problem of image comparison is challenging. Most state-of-the-art methods in the literature are intensity-based which is often inappropriate in real life situations where minor changes in the background may not indicate a meaningful change in the images as long as the boundaries of the image objects remain unaltered. In this talk, I will discuss a feature-based image comparison method based on local pixel clustering. Numerical examples and statistical properties show that this method works well in many real life applications.
AI-driven Integration of Multimodal Imaging Pixel Data and Genome-wide Genotype Data Enhances Precision Health for Type 2 Diabetes: Insights from a Large-scale Biobank Study
The rising prevalence of Type 2 Diabetes (T2D) presents a critical global health challenge. Effective risk assessment and prevention strategies improve patient quality of life and alleviate national healthcare expenditures. Integrating medical imaging and genetic data from extensive biobanks, driven by artificial intelligence (AI), revolutionizes precision and smart health initiatives. This study applied these principles to T2D by analyzing medical images (abdominal ultrasonography and bone density scans) alongside whole-genome single nucleotide variations in 17,785 Han Chinese participants from the Taiwan Biobank. Rigorous data cleaning and preprocessing procedures were applied. Imaging analysis utilized densely connected convolutional neural networks, augmented by graph neural networks to account for intra-individual image dependencies, while genetic analysis employed Bayesian statistical learning to derive polygenic risk scores (PRS). These modalities were integrated through eXtreme Gradient Boosting (XGBoost), yielding several key findings. First, pixel-based image analysis outperformed feature-centric image analysis in accuracy, automation, and cost efficiency. Second, multi-modality analysis significantly enhanced predictive accuracy compared to single-modality approaches. Third, this comprehensive approach, combining medical imaging, genetic, and demographic data, represents a promising frontier for fusion modeling, integrating AI and statistical learning techniques in disease risk assessment. Our model achieved an Area under the Receiver Operating Characteristic Curve (AUC) of 0.944, with an accuracy of 0.875, sensitivity of 0.882, specificity of 0.875, and a Youden index of 0.754. Additionally, the analysis revealed significant positive correlations between the multi-image risk score (MRS) and T2D, as well as between the PRS and T2D, identifying high-risk subgroups within the cohort. This study pioneers the integration of multimodal imaging pixels and genome-wide genetic variation data for precise T2D risk assessment, advancing the understanding of precision and smart health.
This is a joint work (doi: https://doi.org/10.1101/2024.07.25.24310650) with Yi-Jia Huang and Chun-houh Chen.
Day 1 (3/5) 13:15 - 14:30
Two-stage Circular-circular Regression with Zero-inflation: Application to Medical Sciences
This work considers the modeling of zero-inflated circular measurements concerning real case studies from medical sciences. Circular-circular regression models have been discussed in the statistical literature and illustrated with various real-life applications. However, there are no models to deal with zero-inflated response as well as a covariate simultaneously. The Möbius transformation based two-stage circular-circular regression model is proposed, and the Bayesian estimation of the model parameters is suggested using the MCMC algorithm. Simulation results show the superiority of the performance of the proposed method over the existing competitors. The method is applied to analyse real datasets on astigmatism due to cataract surgery and abnormal gait related to orthopaedic impairment. The methodology proposed can assist in efficient decision making during treatment or postoperative care.
Spatial curriculum learning for modeling non-stationary processes in regression coefficients
This study develops a curriculum learning algorithm for modeling non-stationary spatial processes in regression coefficients. Curriculum learning is a machine learning approach where the model is trained on increasingly complex data or tasks over time. Following the idea of curriculum learning, we propose a boosting algorithm that learns coarser spatial processes first, followed by finer processes. In each learning step, a local model, which may explain anisotropic pattern, is estimated and ensembled. The performance of the developed method is verified by simulation experiments and application to residential land price data.
Reconstructing East Asian Temperatures from 1368 to 1911 Using Historical Documents, Climate Models, and Data Assimilation
We present a novel approach for reconstructing annual temperature profiles in East Asia from 1403 to 1911 by leveraging the Reconstructed East Asian Climate Historical Encoded Series (REACHES) database, which comprises climatic information extracted from historical Chinese documents. Due to the absence of instrumental data during this period, the REACHES database provides temperature data recorded at four ordinal levels. However, these index-based data are biased toward extreme weather events, resulting in gaps corresponding to normal weather conditions. To address this bias and reconstruct historical temperatures, we employ a three-tiered statistical framework. First, we apply kriging on an annual basis to interpolate temperature data across East Asia, assuming a zero mean to account for missing information. Second, we adjust the kriged REACHES data to Celsius scales using quantile mapping based on the Last Millennium Ensemble (LME) reanalysis data. Finally, we integrate the adjusted REACHES data with LME simulations using a Bayesian data assimilation technique. Specifically, we model the LME data at each location with a flexible, nonstationary autoregressive time series model, estimated via regularized maximum likelihood with a fused lasso penalty. This model provides a dynamic prior distribution, which we update using Kalman filtering to incorporate the adjusted REACHES time series, yielding posterior temperature estimates. This comprehensive integration of historical documentation, climate models, and advanced statistical methods not only enhances the accuracy of historical temperature reconstructions but also provides a valuable resource for future environmental and climate studies.
Day 1 (3/5) 14:50 - 16:05
High-dimensional Inference using Random Projections
With increasing availability of data, nowadays, we often encounter high-dimensional data from various fields of study. Almost all standard multivariate approaches fail in effectively analyzing such data, both theoretically as well as computationally. Dimension reduction techniques are an essential pre-processing step towards dealing with such data. Among existing methods, the random projection ensemble approach has proven to be a promising approach. In this talk, we plan to discuss the fundamentals of random projections and some of its improvements that have been proposed in the literature. We will also explore implementations of random projections in topics such as classification, testing and regression problems.
Variable Selection for High-Dimensional Heteroscedastic Regression and Its Applications
We are examining variable selection in high-dimensional linear heteroscedastic models. Drawing inspiration from the connection between the linear heteroscedastic function and the interaction model, we develop a two-stage algorithm to identify the relevant variables in the model mentioned above. We demonstrate the selection consistency of our proposed two-stage method and highlight its efficacy through numerical simulations. Furthermore, we leverage our method to pinpoint defective tools during the semiconductor manufacturing process.
Near-perfect Clustering Based on Recursive Binary Splitting Using Maximum Mean Discrepancy
In this talk, we shall discuss some novel clustering methods for functional data when the number of clusters K is not specified and when it is specified. In these algorithms, we use a binary splitting strategy recursively to partition the dataset into two subgroups such that they are maximally separated in terms of an appropriate weighted maximum mean discrepancy (MMD) measure. When K is not specified, the proposed clustering algorithm additionally verifies whether a group of observations, obtained after a binary splitting step, consists of observations from a single population. This algorithm provides a bonafide estimator of K as well. When K is prefixed, a modification of the previous algorithm is proposed which is computationally cheaper. We investigate the theoretical properties of the proposed algorithms in an oracle scenario where the knowledge of the empirical distributions of the observations from different populations are assumed. In such a oracle setting, we show that the algorithm proposed when K is unknown achieves perfect clustering while the algorithm proposed when K is prefixed has the perfect order preserving (POP) property. The near-perfect clustering performance of both the algorithms will be shown by analyzing a variety of simulated datasets generated from models having location difference as well as scale difference.
Day 1 (3/5) 16:30 -
Robust Inference for linear regression models with skewed error distribution
Traditional methods for linear regression generally assume that the underlying error distribution, equivalently the distribution of the responses, is normal. Yet, sometimes real life response data may exhibit a skewed pattern, and assuming normality would not give reliable results in such cases. This is often observed in cases of some biomedical, behavioral, socio-economic and other variables. Here we propose to use the class of skew normal (SN) distributions, which also includes the ordinary normal distribution as its special case, as the model for the errors in a linear regression setup and perform subsequent statistical inference using the popular and robust minimum density power divergence approach to get stable insights in the presence of possible data contamination (e.g., outliers). In the poster presentation of Three Institutes Meeting, we discuss the usefulness of our method with the help of real data example.
On high-dimensional modifications of the nearest neighbor classifier
The nearest neighbor classifier is a widely utilized nonparametric classifier, yet it often encounters significant challenges in high-dimensional, low-sample size (HDLSS) scenarios, particularly when the differences in scale among classes exceed those in location. In this article, we propose modifications to the nearest neighbor classifier aimed at improving its efficacy in classifying high-dimensional data. Our classifiers based on minimum L2 distances, demonstrate effectiveness in situations where traditional methods fail, specifically when location differences get masked by scale differences. Furthermore, our minimum L1 distance based classifier is proficient at discriminating classes that exhibit differences extending beyond their first two moments. However, by analyzing the distances of the initial few neighbors, rather than relying solely on a single neighbor from each class, we can significantly improve the performance of the classifier. Our experiments conducted on various simulated and benchmark datasets indicate that our proposed classifiers yield competitive or superior results in high-dimensional settings.
Elephant Random Walks with Graph-Based Shared Memory: First- and Second-Order Asymptotics
We introduce a generalized model of the elephant random walk, featuring multiple elephants moving along the integer line, Z, and interacting through a shared memory structure governed by a directed graph. In this framework, each elephant’s next step depends not only on its own past trajectory but also on the past steps of other elephants, based on the graph structure. Each vertex in the graph represents an elephant, and directed edges indicate that an elephant considers the previous steps of its in-neighbors when determining its next move. This model thus represents a system of reinforced random walks, evolving under graph-based interdependencies. The first- and second-order asymptotic behaviour of the joint walks will be briefly covered, and, if possible, an outline of the proof techniques and connection to other network-based reinforced processes will also be covered.
Reframing cross-world independence for identifying path-specific effects
This study addresses the challenge of identifying causal mechanisms in real-world problems, which often involve multiple factors and necessitate the evaluation of path-specific effects within a multimediator model. This task requires not only classical causal assumptions but also the unappealing ''cross-world independence'' assumption. Lin et al. (2017) introduced an alternative causal framework using an ''interventional approach,'' which fulfills the cross-world independence by redefining path-specific effects. Later, Stensrud et al. (2021) proposed ''dismissible component conditions'' assumption to identify ''separable effects'' in scenarios involving competing events.
In this study, we investigated the underlying causal concepts of the three causal frameworks in the context of identifying path-specific effects.
We reframed the cross-world independence as the exchangeability between counterfactual worlds, suggesting that, similar to how we achieve exchangeability between actual and counterfactual worlds by controlling sufficient confounders, we can achieve exchangeability between counterfactual worlds by considering sufficient mediators.
Estimation of Nonlinear Structures in Linear Plasma Dynamics
The relationship between zonal flows and turbulence in plasmas has been explored using the nonlinear predator-prey model. In this framework, zonal flows act as "predators" that extract energy from turbulence, while turbulence serves as the "prey" by providing the energy source for zonal flows. This interaction leads to a phenomenon where turbulence promotes the formation of zonal flows, but once established, zonal flows suppress turbulence energy, resulting in a reduction of turbulence intensity. To deepen the understanding of these dynamic interactions, both simulation-based studies and experimental approaches have been actively pursued (P. H. Diamond et al., 2005). This study aims to separate zonal flows and turbulence from data obtained using a linear plasma experimental device and to estimate their nonlinear structures.
To achieve this, parametric models, including the exponential autoregressive (EAR) model, as well as RBF-based semiparametric models, were employed to capture nonlinear characteristics. These models are particularly suited for detecting nonlinear causal relationships that are not adequately represented by conventional linear models. Specifically, the EAR model and its derivatives were utilized to analyze the nonlinearities inherent in plasma dynamics. Moreover, a comparison of the results with those obtained from linear models demonstrated the superior fit of nonlinear models to the data, suggesting the existence of nonlinear causality.
These findings highlight the effectiveness of advanced modeling techniques in elucidating the complex interactions between zonal flows and turbulence in plasmas, paving the way for a more comprehensive understanding of their underlying dynamics.
Relativistic Gaussian Mixture Model
It has been known that the Gaussian mixture model (GMM) is useful for analyzing "non" relativistic plasma. When analyzing relativistic plasma, on the other hand, it is physically unreasonable to adopt a Gaussian distribution as a component distribution of a mixture model. This is because while the speed of plasma particles should be slower than that of light due to the relativistic effects, Gaussian distributions permit particles traveling faster than light. We need to take into account the relativistic effects in the component distribution of the mixture model.
We have developed a mixture model composed of distribution that is obtained by applying Lorentz transformation to the plasma energy represented by the Gaussian distribution, which is known as the relativistic Maxwell distribution or the Jüttner-Synge distribution. Here, we call the component distribution the relativistic Gaussian distribution.
We propose a relativistic Gaussian mixture model (R-GMM) represented by the weighted sum of the relativistic Gaussian distributions, and developed an EM algorithm for estimating the parameters (mixing proportion, bulk velocity and temperature of each component distribution). In particular, the M-step of the EM algorithm we derived equations whose solution is guaranteed to maximize the conditional expectation of the log-likelihood of the complete data. To initialize the parameters for the EM algorithm, we divide data into groups whose number corresponds to that of the components, and in each group we use maximum-likelihood estimates or estimates by the method of moments.
We apply a two-component R-GMM to a distribution function by a particle-in-cell (PIC) simulation of relativistic pair plasma, and separate the simulated distribution function into two components. We find that one component has a large bulk velocity while the other is almost stagnant, and that the two components have almost the same temperatures, which is also consistent to the initial temperature of the PIC simulation. Based on the parameters, we can infer large-scale plasma environments such shocks and discontinuities.
Approximate Maximum Likelihood Estimation For Threshold Jump Processes
We introduce an Approximate Maximum Likelihood Estimation (AMLE) method for parameter estimation in a two-state threshold jump-diffusion model. The threshold mechanism adds a layer of complexity to the estimation process, particularly with discretely sampled data. The AMLE method is designed to overcome these challenges by providing an efficient framework for parameter estimation. The finite-sample performance of the proposed method is assessed through simulation studies, and its practical applicability is demonstrated using two real-world financial time series datasets.
First-order quasi-linear partial differential equations in statistical inferences and solving them with differential geometry
The maximum likelihood estimator (MLE) is the zero of the derivative of the log-likelihood function. We are interested in finding a prior distribution to reduce the bias of the MLE asymptotically to a higher order (Firth 1993, Biometrika) or to construct a Bayesian estimator that asymptotically matches the MLE to a higher order under squared loss (Ghosh--Liu 2011, Sankhya). We can formulate these problems by considering the zero of a penalized log-likelihood and denormalizing the statistical model manifold. Finding an appropriate prior reduces to solving a first-order quasi-linear partial differential equation. Solving partial differential equations with differential geometry is a classic. Moreover, first-order quasi-linear partial differential equations are a special class in the theory of partial differential equations. Namely, solving them reduces the integration of a system of first-order ordinary differential equations, and the existence and uniqueness of the solution are obvious. We embed the solution as a surface in a space one dimension higher than the original manifold, and the prior determines the embedding. Integrating the ordinary differential equations gives the solution as parametric curves on the surface and the implicitization problem is established in computational algebraic geometry. This work is part of collaboration works (arXiv: 2011.14747; 2023, Calcutta Stat. Assoc. Bull.) with a former member of the ISM, Masayo Y. Hirose from Kyushu University.
Direct sampling from conditional distributions by sequential maximum likelihood estimations
We can directly sample from the conditional distribution of any log-affine model (2017, Electron. J. Stat.; Joint Conf. 2019; +Takayama, arXiv: 2110.14922). The algorithm is a Markov chain on a bounded integer lattice, and its transition probability is the ratio of the UMVUE (uniformly minimum variance unbiased estimator) of the expected counts to the total number of counts. The computation of the UMVUE accounts for most of the computational cost, which makes the implementation challenging. Here, we investigated an approximate algorithm that replaces the UMVUE with the MLE (maximum likelihood estimator). Although it is generally not exact, it is efficient and easy to implement; no prior study is required, such as about the connection matrices of the holonomic ideal in the original algorithm. The preprint is arXiv: 2502.00812.
Cross Sectional Regression with Cluster Dependence: Estimation and Testing
We consider cross-sectional dependence within clusters, when cluster size is both finite and infinite. In this paper, we propose a new estimator which is consistent and asymptotically normal under such cross-sectional dependence. We also allow the regression parameter to vary across the clusters. Detailed simulation study examines the efficacy of the proposed estimator.
Parametric inference for the Mann-Whitney effect under survival copula models
The Mann-Whitney effect is a measure for comparing survival of two groups. Under the independence assumption of two survival times, the Mann-Whitney effect can be estimated by Efron’s classical estimator. However, survival times are not generally independent, and the Mann-Whitney effect is biasedly estimated by the classical estimator. In this study, we use parametric copulas to model the dependence, and propose an inference procedure for the Mann-Whitney effect. We also derive the asymptotic variance estimator of the Mann-Whitney effect under various copulas and parametric distributions, and conduct simulation studies to evaluate the accuracy and bias of the proposed estimators. Finally, the proposed inference procedures were illustrated using a real dataset.
Predictive Modeling of Self-reported Diseases Using Medical Image Reports and Multi-omic Data: A XGBoost Study of Taiwan Biobank
Introduction
The Taiwan Biobank (TWB) represents one of Asia's largest and most comprehensive biomedical databases, offering unique insights through extensive medical imaging data collection. While TWB's structured image reports hold substantial potential for disease prediction, their predictive capabilities, particularly when integrated with other biomedical data, remain underexplored. This study investigates the predictive power of standardized medical image reports as the primary focus, while integrating polygenic scores and demographic factors through XGBoost, a powerful machine learning algorithm known for handling complex medical data.
Methods
We analyzed data from 22,067 TWB participants (7,895 male and 14,172 female; mean age 56.08 years, range 32.52-79.60) who completed their first round of follow-up examinations. The study integrated three types of data: five imaging modalities (abdominal ultrasound, bone densitometry, electrocardiography, carotid vascular ultrasound, and thyroid ultrasound) yielding 155 standardized features, 3,636 polygenic scores, and demographic factors (age and sex). XGBoost implementation included optimized parameters (learning rate=0.3, maximum depth=6) with L1/L2 regularization and class imbalance correction, employing three-phase validation (64% training, 16% validation, 20% testing).
Results
Medical image reports demonstrated strong predictive capabilities across multiple disease domains. Vascular imaging features showed remarkable performance in cardiovascular disease prediction, achieving 74% accuracy for hypertension. Bone densitometry and vascular imaging predicted diabetes with 72% accuracy, representing a 12% improvement from imaging features. Multiple imaging modalities effectively predicted age-related conditions, with accuracies of 71% for cataracts and 73% for osteoporosis. The integration of different imaging modalities enhanced prediction accuracy by 8-15%. Feature importance analysis revealed key imaging markers: carotid measurements for cardiovascular predictions, bone density parameters for metabolic diseases, and tissue characteristics for age-related conditions. The integration of polygenic scores and demographic factors provided additional predictive value, while imaging features maintained strong independent predictive power.
Discussion and Conclusion
This study establishes the substantial predictive capabilities of TWB's medical image reports through XGBoost analysis. The findings demonstrate that structured imaging features can achieve robust disease prediction, with multi-source data integration further enhancing accuracy. These results provide a foundation for developing comprehensive screening protocols, highlighting the potential of integrated medical data analysis in advancing precision medicine for Asian populations.
Keywords: Taiwan Biobank, disease prediction, medical imaging, polygenic scores, XGBoost, biobank analysis
Statistical Modeling of Financial Data with Skew-Symmetric Error Distributions
Based on corporate financial data for almost all companies listed on the Prime Market of the Tokyo Stock Exchange in fiscal year 2021, we gradually refine a model to explain firms' sales by the number of employees and total assets. Starting from a Cobb-Douglas type functional form linearized by a log transformation, the assumption of a skew-symmetric distribution in the error structure and the introduction of industry dummies are shown to be useful not only for searching for a good-fitting model, but also for ensuring the accuracy of important parameters such as the labor share. The introduction of industry dummies helps to improve the accuracy of the model as well as to allow for interpretation as sector-wise Total Factor Productivity.
Developing an information criterion under the double-descent phenomenon
The focused information criterion (FIC) is that obtained by measuring the estimation error (mean squared error) of the parameter of interest, unlike the AIC-type information criterion which attempts to measure the overall divergence between the estimated and the true distributions. Considering that FIC targets the mean squared error, we deal with cases where the explanatory variable is of high dimension and the `double-descent phenomenon' occurs, and examine whether an FIC-type information criterion captures the phenomenon.
Model selection method for spatially varying models via Bayesian generalized lasso
Spatially varying coefficient (SVC) models are widely used in geographical data analysis. In the models, the regression coefficients vary in each location. In geographical data, variables belonging to adjacent locations tend to have similar roles in prediction. The generalized lasso captures this characteristic by shrinking the difference between the regression coefficients corresponding to variables in adjacent locations. The Bayesian generalized lasso makes this possible by assuming a Laplace prior on these differences.
The number of hyper-parameters of the Laplace prior is related to the complexity of the estimated model. Using the same hyper-parameter for the regression coefficients of different variables corresponds to doing simple modeling. On the other hand, using different hyper-parameters for the regression coefficients of different variables corresponds to doing complex modeling.
To select the better SVC model, WAIC is a well-known powerful tool, but it always selects to use more hyper-parameters. To avoid this problem, the prior intensified information criterion (PIIC) has been proposed. However, it only considers the lasso problem and cannot be applied to the Bayesian generalized lasso. In this research, we propose PIIC for the SVC model via the Bayesian generalized lasso. We investigate the performance of our method through numerical studies.
A Generalized Mean Approach for Distributed-PCA
Principal component analysis (PCA) is a widely used dimension reduction technique. As datasets continue to grow in size and complexity, distributed-PCA (DPCA) has become an active research area. A fundamental challenge in DPCA involves efficiently processing and aggregating information across multiple machines or computing nodes while preserving the statistical characteristics of the original dataset. Fan et al. (2019) proposed a DPCA algorithm to estimate the leading rank-r eigenspace of the population covariance matrix for a given r. Although their approach is communication-efficient, it does not utilize eigenvalue information and relies solely on the leading eigenvectors in aggregation, potentially resulting in less accurate estimation. In this study, we propose a novel DPCA algorithm that incorporates eigenvalue information and oversamples eigenvectors to aggregate local results via the matrix β-mean, which we refer to as β-DPCA. Our proposal offers a flexible and robust aggregation method through the adjustable choice of β values. Moreover, β-DPCA is shown to associate with the matrix β-divergence, a subclass of the Bregman matrix divergence. Some fundamental properties inherent to the Bregman matrix divergence also hold for the matrix β-divergence.
Day 2 (3/6) 10:00 - 10:50
Outlier-Robust Neural Network Training: Efficient Optimization of Transformed Trimmed Loss with Variation Regularization
In this study, we consider outlier-robust predictive modeling using highly-expressive neural networks. To this end, we employ (1) a transformed trimmed loss (TTL), which is a computationally feasible variant of the classical trimmed loss, and (2) a higher-order variation regularization (HOVR) of the prediction model. Note that using only TTL to train the neural network may possess outlier vulnerability, as its high expressive power causes it to overfit even the outliers perfectly. However, simultaneously introducing HOVR constrains the effective degrees of freedom, thereby avoiding fitting outliers. We newly provide an efficient stochastic algorithm for optimization and its theoretical convergence guarantee.
Preprint for this study is publicly available at: https://arxiv.org/abs/2308.02293
Fatty Liver Classification via Risk Controlled Neural Networks Trained on Grouped Ultrasound Image Data
Ultrasound imaging is a widely used technique for fatty liver diagnosis as it is practically affordable and can be quickly deployed by using suitable devices. When it is applied to a patient, multiple images of the targeted tissues are produced. We propose a machine learning model for fatty liver diagnosis from multiple ultrasound images. The machine learning model extracts features of the ultrasound images by using a pre-trained image encoder. It further produces a summary embedding on these features by using a graph neural network. The summary embedding is used as input for a classifier on fatty liver diagnosis. We train the machine learning model on a ultrasound image dataset collected by Taiwan Biobank. We also carry out risk control on the machine learning model using conformal prediction. Under the risk control procedure, the classifier can improve the results with high probabilistic guarantees.
Day 2 (3/6) 11:10 - 12:00
Sim2Real Machine Learning in Data-Driven Materials Research
We are developing the world's largest computational database for polymer materials, based on first-principles calculations and molecular dynamics simulations, to overcome the lack of data resources in polymer research [1]. In this talk, I will explain the methodologies and practical applications of Sim2Real machine learning to bridge the gap between the incomplete computational world and the uncertain, complex real-world systems. Specifically, I will discuss scaling laws in Sim2Real transfer learning and related topics [2].
Molecule Discovery and Optimization via Evolutionary Swarm Intelligence
Since the advent of computational analysis and visualization of chemical compounds, Computer-Aided Drug Design has made significant contributions to drug discovery. Recently, de novo drug design and molecular optimization have garnered considerable attention. Traditional optimization methods often struggle with the discrete nature of molecular space, but evolutionary computations have demonstrated their versatility across various optimization problems, regardless of the nature of the objective functions. This paper introduces a novel evolutionary algorithm, the Swarm Intelligence-Based Method for Single-Objective Molecular Optimization. Several experiments were conducted to showcase the efficiency of the proposed method, which identifies near-optimal solutions in a remarkably short time. The results were then compared with those of other state-of-the-art methods in the field. This is a joint work with Ms. Hsin-Ping Liu (DSDP, NTU) and Mr. Shen-Ching Feng (ISSAS).
Day 3 (3/7) 10:00 - 10:50
Subgraph Counting under Local Differential Privacy
Subgraph counts, such as triangle and k-star counts, are useful for analyzing connection patterns or clustering tendencies in graph data. However, sensitive data (e.g., sensitive friendship information) can be included in a graph and leaked from the subgraph counts. To prevent such data leakage, we present algorithms for counting subgraphs under a strong privacy notion called LDP (Local Differential Privacy). For k-stars, we propose a one-round algorithm that achieves an order optimal estimation error among all one-round LDP algorithms. For triangles, we propose a two-round algorithm and show that an additional round significantly reduces the estimation error. We also present a new lower bound on the estimation error for general graph functions, including k-star and triangle counts. We show through experiments that the proposed algorithms can accurately estimate k-star and triangle counts under LDP.
This is a joint work with Jacob Imola and Kamalika Chaudhuri. https://www.usenix.org/conference/usenixsecurity21/presentation/imola
Bivariate Analysis of Distribution Functions Under Biased Sampling
We compare distribution functions among pairs of locations in their domains, in contrast to the typical approach of univariate comparison across individual locations. This bivariate approach is studied in the presence of sampling bias, which has been gaining attention in infectious disease studies that over-represent more symptomatic people. In cases with either known or unknown sampling bias, we introduce Anderson--Darling-type tests based on both the univariate and bivariate formulation. A simulation study shows the superior performance of the bivariate approach over the univariate one. We illustrate the proposed methods using real data on the distribution of the number of symptoms suggestive of COVID-19.
Day 2 (3/7) 11:10 - 12:00
On Exact Feature Screening in Ultrahigh-dimensional Classification
In this talk, we first motivate and analyze the well-known average distance classifier and its variants in the high-dimensional scenario. We will then discuss a new model-free feature screening method based on energy distances for ultrahigh-dimensional binary classification problems. Unlike existing methods, the cut-off involved in our procedure is data adaptive. With a high probability, our procedure retains only relevant features after discarding all the noise variables. The proposed screening method is also extended to identify pairs of variables that are marginally undetectable but have differences in their joint distributions. Finally, we build a classifier that maintains coherence between the proposed feature selection criteria and discrimination method and also establish its risk consistency. A numerical study shows clear and convincing advantages of our classifier over existing state-of-the-art methods.
Signature for response to PD-L1 inhibitor in metastatic Urothelial Cancer
To date, immune checkpoint inhibitors (ICIs) are one of the frontier treatments that have improved the survival of metastatic cancer patients with few side effects. However, the objective response rate for ICIs is low, only ~30% in urothelial carcinoma (UC), highlighting the need to identify signatures for response prediction. Several state-of-the-art signatures have been revealed in first-tier journals, demonstrating the area's importance. As the number of genes (features; ~20,000) greatly exceeds the sample sizes of training sets (≤300), we first developed feature selection procedures to reduce features to a few hundred. Next, we trained several classifiers using IMvigor210 and the selected genes, which comprise RNA-seq and clinical data of ~298 patients with mUC. In particular, our predictor (LogitDA) with the revealed signature achieved a prediction AUC of 0.75; our signature outperformed the known signatures compared, e.g., PD-L1, PD-1, the IFNG, tGE8, T exhaust, and T inflamed signatures. Together, our findings show that LogitDA and our signature predict the immunotherapy response well in mUC.
Key words: biomarker, cancer, machine learning, regression, prediction
Day 3 (3/7) 13:15 - 14:30
Imaging the Black Hole Shadow
In 2019 and 2022, the EHTC (Event Horizon Telescope collaboration) released the image of the black hole shadow of M87 and that of our Milky Way galaxy, respectively. The EHT is a VLBI (very long baseline interferometer), which differs from optical telescopes in that a large amount of computation is required to obtain a single image. The EHTC has more than 300 members from different backgrounds and countries. Black hole imaging is an interesting problem from the data scientific viewpoint. In the presentation, I will explain how the new imaging techniques have been developed and how the final images were created through our discussions.
A generalized Pegram's operator based autoregressive process for modelling categorical time series
This paper considers the problem of modelling categorical time series data with application from air quality data in India. Markov Chains are widely used to model categorical time series. However, it suffers from a large number of parameters when the order of a Markov chain and/or the number of categories of a time series is large. As an alternative, the mixture transition distribution (MTD) model, multinomial logistic regression model, and Pegram’s operator based autoregressive (PAR) process are some useful models that require a relatively smaller number of parameters. However, they also have their limitations. For example, the PAR process of order p involves an indicator kernel that gives weight to the transition probability only when the same previous category occurs at time t − 1, . . . , t − p. Here, a new model, namely the generalized PAR (GPAR) process, is proposed using a generalized kernel function that gives weight to all the possible cases. The proposed model is mainly defined for ordinal categorical time series where categories are ordered in nature, like the air quality of a city observed as Healthy, Unhealthy, and Hazardous. We study the distributional properties and h-step ahead forecasting features of the proposed process, along with the estimation of model parameters. Extensive simulation experiments are carried out to investigate the utility of the proposed process. Finally, the method is illustrated through a real dataset on the air quality of Mumbai city.
Statistical modeling for earthquake early warning and its applications
Earthquake Early Warning (EEW) systems are vital tools for mitigating the impacts of seismic events by providing timely alerts to populations and infrastructure at risk. Japan’s EEW system has undergone significant changes since its first public operation in 2007, including the expansion of seismic sensor networks and enhancements in data processing algorithms. In this talk, we will review the problem setup and statistical models for EEW and present recent improvements to the systems and their related applications.
Day 3 (3/7) 14:50 - 16:05
Interacting Urn Schemes: A simple and solvable model of "Self-Organized Criticality (SOC)"
In this talk, we will introduce a novel model of "interactive urn schemes" with the goal of obtaining a limiting distribution, which may be considered as a simple and solvable example of "Self-Organized Criticality (SOC)". The interactions will be defined via a network (possibly infinite). We will show that limit exists under fairly general condition if the underline graph has a specific directed structure. We will further indicate how limits can be proved for more general structure including undirected graphs.
[This is a joint work with Deborshi Das]
A measure-on-graph-valued diffusion: a particle system with collisions and their application
First collision time of random walks - convergence to the Brownian web
We consider 3 independent simple symmetric random walks starting from $-2i, 0 $ and $2j$ respectively. We show that the expected time of their first collision is finite and actually takes the value $4ij$. If time permits, we will talk about the use of the result in computation of density of coalescing simple symmetric random walks starting from all even integers and its application to the Brownian web.