Workshop MAST
MAthematical STatistics for complex data
MAthematical STatistics for complex data
Program
08:15 - 08:40 Welcome
08:40 - 08:50 Opening, Ernesto De Vito Coordinatore del corso di laurea in SMID
08:50 - 09:10 Davide Risso (Università di Padova), Conformal inference for cell type annotation with graph-structured constraints
Abstract: Conformal prediction is a framework for constructing prediction sets for machine learning models, relying solely on the exchangeability of training and test data and without requiring the specification of a parametric distribution. Despite its wide applicability and popularity, its application to the omic sciences remains underexplored. Here, we present an approach that leverages the rich information about cell-type relations, encoded in the graph structure of cell ontologies, to enhance the interpretability of reference-based cell-type annotation in single-cell transcriptomics. Leveraging conformal risk control, we present a novel conformal algorithm for graph-structured predictions and we demonstrate how incorporating graph constraints can improve the interpretation of cell-type predictions.
09:10 - 09:30 Alberto Caimo (University College Dublin), Separable models for dynamic signed networks
Abstract: Signed networks are essential for understanding systems where both supportive and antagonistic interactions shape collective behaviour. This talk introduces a separable temporal framework based on multi-layer exponential random graph models. By assuming conditional independence between interaction existence and its polarity, our model maintains flexibility while remaining grounded in balance theory.
Following a Bayesian paradigm, we utilise an adaptive exchange algorithm to perform inference on the model parameters. We illustrate the practical utility of this approach by analysing relations among U.S. Senators during Ronald Reagan’s second term (1985–1989), uncovering whether political alliances during this era were driven by stable balance or shifting antagonistic blocks.
09:30 - 09:50 Ilaria Buselli (Zenabyte s.r.l.), No Metric Is an Island: How Algorithmic Fairness Interacts with Other AI Properties
Abstract: The rapid integration of Artificial Intelligence (AI) across diverse societal domains has intensified concerns about the trustworthiness of automated systems. Beyond accuracy and efficiency, AI must now also satisfy broader ethical and technical desiderata, such as safety, reliability, equity, or transparency. Yet, these properties are not independent: their intersections often involve trade-offs or interplays that challenge both theoretical analysis and practical deployment. Stemming from this context, this talk specifically explores the interaction of algorithmic fairness (i.e., the tendency of algorithms to perpetuate or amplify historical biases against sensitive groups) with two critical dimensions of trustworthy AI: robustness (i.e., the ability of models to maintain their performance under data perturbations) and regressiveness (i.e., the tendency of updated models to fail predictions that were correctly performed by older versions of the model). In doing so, it provides both conceptual insights and practical methodologies for developing AI systems that are not only effective but also equitable and trustworthy.
09:50 - 10:10 Amerigo Novaro (Università di Padova), Encoding Dependence in Multivariate Extremes through Graphical Structures
Abstract: Modeling extreme values is useful in several practical fields, because many of relevant risks today are related to rare events. In climate science, for instance, it is used to assess extreme heatwaves, intense rainfall, or coastal storms. Similarly, in f inance, techniques of extreme value theory quantify tail risk such as market crashes. In multivariate settings, the focus is on the joint behavior of variables in the tails. Pairs of variables can show asymptotic dependence or asymptotic independence, depending on the survival of the relationship. In the first case, dependence remains even asymptotically, while in the second case, the effect vanishes in the tail, even if residual dependencies may still be present in sub-asymptotic regions of the tails. A convenient way to treat high-dimensional multivariate structures is to encode variables through a graph, representing the independence structure via graphical models. Recent works in extreme value theory have explored the advantages offered by a graphical structure, focusing mainly on regimes with full asymptotic depen dence, such as those based on the Hüsler-Reiss distribution. This talk explores how a graphical structure can also offer practical advantages in a regime of asymp totic independence. The discussed multivariate construction is based on a Gamma convolution model, leading to Gamma-driven Pareto vectors that provide a flexible and parametric class of models under asymptotic independence. This choice allows for the representation of the marginal dependencies through a bi-directed graph.
10:10 - 10:30 Lorenzo Ferri (Scuola Politecnica Federale di Losanna, EPFL), Network autoregressive time series: estimation and the effect of network structure
Abstract: Generalized Network AutoRegressive (GNAR) models bridge classical multivariate time series analysis and network science, and are gaining popularity in computational social science and econometrics. Given a network, each node is associated with an autoregressive time series that is influenced by its own past values as well as by past values of neighbouring nodes.
This talk introduces an extension of existing GNAR models that explicitly incorporates the effect of node neighbourhood size into the model specification, leading to improved convergence rates of the parameter estimates. This result highlights the influence of network structure on these simple yet powerful models and opens new research directions on the relationship between network topology and parameter estimation in network time series.
10:30 - 11:00 Coffee Break
11:00 - 11:20 Marta Ponzano (Link Campus University), Wqsreg: a Stata command for Weighted Quantile Sum regression
Abstract: Weighted Quantile Sum (WQS) regression is a statistical method for quantifying the association between multiple possibly correlated predictors and a health outcome, estimating both the joint effect of the predictors as well as their individual contributions to the total effect. WQS includes two steps of i) exposures aggregation via the estimation of a summary score – the WQS index – and ii) estimating the association between the WQS index and the outcome of interest via regression. Implementation: We present wqsreg, the first Stata command for WQS regression, implemented for continuous, binary and count outcomes. We present an application of the command on exposome data exploring the association between 38 exposures and a continuous outcome. General features: wqsreg provides a user-friendly command for WQS regression that integrates several flexible components of the framework. A common approach is to use a 40/60 training/validation ratio and to estimate weights using at least 100 bootstrap sample replications. A recent extension has further introduced the use of repeated holdouts to improve generalizability by eliminating the dependency of the results on the random seed that splits the data. Implementation of this statistical method requires transforming the original predictors into quantiles, and introducing the following constraints: a) weights must lie between 0 and 1 and sum to 1; b) the user has to define a priori whether to estimate a positive or a negative index. wqsreg returns regression estimates as well as graphical displays of the individual weights. It requires Stata version 11 or higher (StataCorp, College Station, TX, USA). Availability: The wqsreg command is freely available on GitHub [https://github.com/PonzanoMarta/wqsreg] and has been published under General Public License version 3.
11:20 - 11:50 Sara Colleoni (Valos S.r.L.), Bayesian dynamic borrowing: improving treatment effect precision in clinical trials with limited sample sizes by borrowing information from historical clinical trials.
Abstract:
11:50 - 12:10 Matteo Petrosino (Università degli studi Milano-Bicocca), Modeling complex clinical longitudinal data: Joint models and Ordinal State Transition models
Abstract: Dynamic information is crucial for monitoring and predicting patients’ health status. The Time-Dependent Cox Model (TDCM) is widely used to analyze longitudinal biomarkers and time-to-event data. However, when dealing with endogenous variables, as biomarkers, the terminal event truncates the observation of their full trajectory, leading to Missing Not At Random (MNAR) data. Joint Models (JMs) accommodate this type of MNAR data by explicitly modeling the missing process due to terminating event. They integrate a longitu dinal sub-model for the biomarker trajectory, through a mixed-effects framework, with an event sub-model. However, challenges remain when biomarker measurements are subject to other MNAR sources. We used simulations to assess JMs’ robustness under different missing data mechanisms, comparing them to TDCM. Simulation results indicated that JM remains robust despite truncated or intermittently missing markers, but the presence of el evated measurement error increases uncertainty and can cause moderate-to-large bias. The TDCM remains effective with minimal measurement error, but joint modeling is preferable when error is abundant, especially under MNAR missingness in longitudinal data, beyond just informative censoring. Nonetheless, correctly modeling of the trajectory, as seen, is essential for conducting a robust analysis. Another complex context arises when biomedical researchers encounter situations where multiple endpoints are of clinical interest, such as mortality, hospitalization, disease progression, or treatment failure. A common approach is to combine them into a single time-to-event composite outcome. Despite their widespread use, traditional composite outcomes suffer from several important limitations: composite endpoints assign equal importance to events of varying clinical significance and consider only the first occurrence of any component event, ignoring the sequential nature of disease progression, and fail to capture how patients transition between different health states over time. To address these limitations we propose the use of ordinal transition models by a sensible ordering of events or combination of events. Ordinal transition models extend the cumulative probability model for ordinal data to longitudinal outcomes through transition modeling. Transition modeling accounts for within subject correlation by conditioning the mean model for each outcome on the previously observed outcomes for that subject. By modeling transitions between ordered health states, these models maintain the clinical sig nificance of different event types and their temporal sequence. These models capture the dynamic nature of disease progression and accommodate censoring and absorbing states (as death). By depicting the different outcomes as different stages of disease, State Occupancy Probabilities can be derived, offering an intuitive visualization of progress pathways.
12:00 - 12:20 Marzia Pedemonte (Data Science Institute - UHasselt Belgio), Covariate Adjustment in Multivariate Outcomes
Abstract: In health care, the approval of new treatments relies on clinical trials. Traditional methods focus on single endpoints, but this is often insufficient to capture the full benefit in diseases, to jointly evaluate benefit-risk and to take account of patient-reported outcomes (PROs). Statistical methods for analyzing multiple outcomes have been developed, but are limited in several ways, not the least in the number and types of outcomes that can be combined. Recently, the Generalized Pairwise Comparisons (GPC) method has been suggested, which addresses these concerns, and has gained traction in clinical applications and even has led to drug approvals. The disadvantage of GPC is that it lacks covariate adjustment. Probabilistic Index Models (PIM) have been developed independently, but cannot deal with multivariate outcomes, nor with missing values. The objective of the project is to extend the PIM methodology to multivariate outcomes, handling missing data and extend inference to rare diseases, so as to make them useful for clinical practices. The advantages of PIMs will be investigated by comparing them to alternative methods for covariate adjustment (e.g. joint models and semiparametric ANCOVA).
12:20 - 12:40 Cristina Marelli (Université Paris-Saclay), Using permutation-based tests to evaluate adaptations to molecular treatment algorithms in randomized precision oncology trials
Abstract: In precision oncology, several randomized trials have evaluated molecular treatment algorithms - that attribute targeted treatments to subgroups of patients matched on biomarkers (the experimental arm) - against standards-of-care. In such multi-marker, multi-treatment, 2-sample trials, the molecular treatment algorithm could need to be modified during the course of the study. Two types of adaptive modifications were studied in this work: (i) one of the experimental treatments in the algorithm is dropped due to an event external to the trial, or (ii) an interim futility analysis leads to dropping the least effective experimental treatment. In both scenarios, we explored whether participants were excluded from the trial or oriented to an alternative experimental treatment after the adaptation. We proposed using permutation-based test (e.g. permutation and randomization tests) to control the Type I error rate of such adaptive changes to the algorithm and compared their performance with that of classical tests in both simulation studies and in real French oncology clinical trials. We first conducted simulation studies to assess Type I error rate and power of an unstratified t-test and an F-test, using both asymptotic and restricted permutation methods, using normally distributed outcomes. Different settings were established with varying biomarker prognostic effects, treatment effect magnitudes, and shifts in prognosis and/or treatment effects resulting from changes in the case-mix after adaptation. In scenario (ii), three values for the futility threshold were explored. Three specifications of the F-statistic were used to evaluate the global null hypothesis of no overall treatment effect, either through a permutation test or an asymptotic F-test. The proposed approach was then applied to time-to-event data from the SAFIR02 trials, in which the p-values of the asymptotic and the randomization log-rank tests were compared. The simulation results showed that permutation tests maintained the Type I error at the nominal level in both cases of external and data-driven modifications to the algorithm. In contrast, classical tests revealed deflated or inflated Type I error rates in the presence of unaccounted variations in the case-mix and selection biases introduced by the interim decision. Proper specification of the underlying test statistic was fundamental to preserving statistical power. The results of the clinical trial applications further confirmed the feasibility of using such tests in both scenarios. In conclusion, permutation-based tests are useful in the case of both external and data-driven modifications to the algorithm, although careful attention must be paid to restricting permutations on interim decisions.
12:40 - 14:00 Lunch Break
14:00 - 14:20 Sara Muzzì (Università degli studi di Milano-Bicocca), From Waiting Rooms to Emergency Rooms: How GP Switching Influences ED Demand
Abstract: This study investigates the determinants of emergency department (ED) utilization, distinguishing between medically appropriate and potentially avoidable visits. It quantifies the relative importance of patient-specific characteristics—such as health status and care preferences—versus features of the primary care environment in shaping ED use. To identify these effects, the analysis exploits exogenous changes in residence and general practitioner (GP) assignment among “movers.” When patients relocate across healthcare districts and are assigned to a new GP, their exposure to different primary care contexts changes, providing a natural experiment to disentangle individual from area-level influences. Using administrative data on ED visits and GP assignments in Lombardy, movers are tracked over time to analyze changes in total ED visits, visits for ambulatory care sensitive conditions (ACSCs), and clinically appropriate visits. The results indicate that a substantial share of variation in potentially avoidable ED use is associated with the primary care environment, while clinically appropriate visits are largely explained by persistent patient characteristics. These findings highlight the role of primary care quality and organization in reducing avoidable ED utilization and inform policies aimed at improving the efficiency of emergency services.
14:20 - 14:40 Alessia De Crescenzo (Politecnico di Torino), Robust scheduling in stochastic flowshops
Abstract: Static scheduling has been the main focus in the literature for decades, providingarobustfoundationofalgorithmstooptimizeperformanceunder deterministic parameters. However, in real-world industrial environments, these assumptions rarely hold: stochastic variations often render optimal and suboptimal deterministic schedules impossible to execute. In a flow shop configuration, the impact of such disruptions is amplified by the se quential nature of the process, where deviations propagate through the system to create a cascade of delays. To mitigate these risks, we propose a generalization of the methodology presented, which was originally designed for single-machine scheduling using the Value-at-Risk (VaR) of maximum lateness as the primary objective. We extend this approach to a multi-stage flow shop environment with stochastic processing times and release dates. The core of this extension lies in the recursive estimation of the Cumulative Distribution Function (CDF) of completion times across sequential stages, effectively modeling the release date distribution for a job onagivenmachineasitscompletiontimedistributionontheprecedingone. By leveraging this relationship, we extend the original bounding strategies to propagate uncertainty throughout the entire system. Furthermore, we explore the results obtained when considering the Conditional Value-at Risk (CVaR) instead of VaR as the primary objective. By accounting for the entire tail of the distribution rather than a single percentile, CVaR provides a more robust and risk-averse framework that effectively mitigates the im pact of extreme, low-probability delays. This shift ensures the resulting schedules are better equipped to handle the worst-case scenarios inherent in volatile industrial settings.
14:40 - 15:00 Carola Di Meo (IRCCS Ospedale Giannina Gaslini), Statistical Approaches for Genetic and Epigenetic Studies
Abstract: The analysis of genetic and epigenetic data relies on statistical methods specifically designed to address high dimensionality, where the number of variables largely exceeds the sample size (p ≫ N; p ≈ 10⁶–10⁷ in the human genome and N often in the hundreds), as well as complex correlation structures. The main statistical methods used in genetics and epigenetics are introduced, with particular attention to their applications in association studies and in the modeling of complex molecular architectures. Genetic data from a cohort of individuals affected by post-COVID Multisystem Inflammatory Syndrome in Children (MIS-C) are used to illustrate broadly adopted methods for genetic association analysis, including genome-wide association studies (GWAS). The analysis introduces the concepts of common and rare genetic variation and outlines the main strategies used to study their role in disease susceptibility and complex traits. Population-based and family-based designs are employed to assess disease susceptibility, illustrating how different approaches address confounding, inheritance patterns, and limited sample sizes. Aggregation methods for rare variants are emphasized for their ability to increase statistical power in sequencing-based studies. As a second example, additional complexity arises when genetic variation is considered alongside environmental exposures, giving rise to epigenetic regulation. In a cohort of Sardinian long-lived individuals, epigenome-wide association analyses, epigenetic clocks, and methylation-based surrogate markers are used to estimate biological age and infer key biological traits. These markers complement genetic analyses by integrating environmental imprinting and age-related molecular changes. Finally, multi-omics integration is presented as a framework for jointly analyzing genetic and epigenetic data to improve the characterization of complex phenotypes. Potential future developments are also discussed, highlighting how large language models and foundation models could be leveraged to model genetic data and support the implementation of new analytical tasks in omics research. These examples highlight the potential of combining data from statistical genetics, epigenetics, and multi-omics analyses to dissect the molecular basis of complex traits, improve predictive insights, further optimize clinical decision-making, and advance translational research in human health and aging.
15:20 - 15:40 Vincenzo Gioia (Università degli studi di Trieste), Bayesian Markov-Switching for Heat Waves
Abstract: Heat waves are climate extremes that deserve attention for their potential effects on human health and ecological systems. They are informally defined as prolonged periods of extremely high temperatures, leading operative definitions to jointly account for both the abnormality of the temperatures and their persistence over time. Moreover, when the geographical domain is explicitly considered, the high local nature of the phenomenon makes heat wave modelling very challenging, as climate models often lack the spatial resolution required to reproduce local-scale dynamics.
In this context, the Markov-switching model provides a flexible and interpretable framework for jointly modelling local summer temperatures and classifying heat wave periods probabilistically. The proposed approach has been developed progressively, starting from a two-regime specification and extending to multi-state models that allow a finer characterisation of temperature dynamics. In particular, daily maximum summer temperatures are modelled through additive Markov-switching models, where the observable process incorporates smooth seasonal components and covariates for capturing long-term trends. The latent regimes represent distinct thermal conditions, ranging from moderate to extreme, and the underlying process is modelled using a first-order Markov chain with suitable assumptions on the transition dynamics. For the hottest regime, the use of generalized extreme value distributions is explored to better capture tail behaviour. Statistical inference is carried out within a Bayesian framework, enabling the derivation of predictive distributions for heat-wave measures (frequency and duration) via simulation from the posterior distribution. Under the two-state model, regime-membership probabilities are calibrated using a popular quantile-based approach, and threshold probabilities are used for classification purposes. Instead, when adopting a multi-state model, the classification is directly based on the probability of being in the hottest regime. The proposed model can then be used to describe historical patterns of heat wave episodes in a given area, to relate them to a large-scale climate index, and to analyse climate-based scenarios using temperature simulations.
The proposed approach is illustrated using the daily summer temperature data of the last three decades from multiple stations located in the northeastern Italian region Friuli Venezia Giulia. Overall, using minimal information, the Markov-switching model can capture heat wave dynamics over time, showing substantial interannual variability both in terms of frequency and duration, as well as marked differences across sites, highlighting both similarities and local heterogeneity driven by climatic and geographical features.
15:45 - 17:00 Altre voci dai laureati
17:00 - 19:30 La festa