Workshop

다중척도방법론 워크샵 2025

소개

다중척도연구실 구성원과 졸업생들이 인적, 학문적 교류를 하기 위한 워크샵을 개최한다. 이를 통해 서로의 연구 주제에 대해 이해하고 논의해볼 수 있을 것으로 기대한다. 각자의 연구 주제에 대해 20~40분에 걸쳐 발표하고, 질의응답을 진행한다.

일시 및 장소

2025년 2월 14일(금) 오후 1시 ~ 2월 15일(토) 오후 12시
서울대학교 시흥캠퍼스 컨벤션센터 계단식강의실 609호(Session), 603호(저녁 프로그램)

프로그램 및 일정

14일 석식: 봄이보리밥 배곧점 (17:40)
15일 조식: 서울대학교 시흥캠퍼스 에스라운지(S-LOUNGE) 조식 (7:30~8:30)
15일 중식: 투파인드피터 배곧점 (12:30)

2월 14일 (금)

현장등록 [13:00~13:20]

Opening Remark: 오희석 [13:20~13:30]

Session 1: High-Dimensional Data Analysis and Applications (Chair: 강승우) [13:30~15:10]

TLRR-TF: A Fast Tensor Low-Rank Representation via Tri-Factorization (권영욱, 서울대학교 통계학과) [13:30~14:00]

Low-rank representation (LRR) is effective in segmenting data points into their intrinsic linear subspaces and representing the original signal using the lowest-rank criterion. Recently, LRR methods for tensor data have gained increasing attention. Still, most tensor LRR algorithms require computing the singular value decomposition (SVD), which is computationally expensive when dealing with large tensor data. We propose a new robust tensor LRR model using a fast tri-factorization approach that approximates the representation tensor as the product of three smaller tensor components. The main advantage of the proposed method is that it mitigates the computational cost of applying nuclear norm minimization directly to the large original tensor. Furthermore, our method simultaneously processes sparse and dense Gaussian noises to recover clean tensors effectively using different norms. Through extensive experimental experiments, including synthetic data and real-world image data, we demonstrate the promising performance of the proposed method in both clustering/denoising accuracy and computing time.

Keywords: Low-rank representation, Tensor decomposition, Subspace clustering, Tensor recovery

Composite Quantile Factor Modeling for High-Dimensional Data (박세은, 서울대학교 통계학과) [14:00~14:30]

This study proposes a composite quantile factor model (CQFM), a novel approach that considers multiple quantile levels to identify common factors that explain the structures of high-dimensional data. By leveraging the strengths of the quantile factor model, which captures distributional characteristics by focusing on a single quantile, the proposed CQFM enhances accuracy in identifying factors that are relevant across different quantile levels and uncovers detailed features in data that are not effectively captured by only a single quantile. To estimate the quantile-dependent loadings and common factors of CQFM, we develop a practical algorithm that minimizes an objective function based on a weighted average of check functions across quantiles. The theoretical properties of these estimators are investigated, including their consistency and convergence rates. Furthermore, asymptotic distributions are derived using approximated estimators obtained from a kernel-smoothed objective function. In addition, we propose two consistent estimators for determining the number of factors. Simulation studies demonstrate that CQFM consistently outperforms QFMs at individual quantiles over different data distributions. In particular, CQFM excels in identifying common factors that remain hidden when focusing on a specific quantile. Real data analyses on capturing factors in volatility and forecasting further validate the effectiveness of the proposed model.

Keywords: Multiple quantiles, Quantile factor model, Data structure, Hidden factors, Location-scale-shift model

Adaptive Boosting on Linear Network (임승연, 한양대학교 응용통계학과) [14:30~14:50]

Classification is a supervised machine learning method, predicting a categorical response variable using several explanatory variables. If observations are sampled from a spatial point process, then we can also use x and y coordinates as explanatory variables. If the observations are sampled from a known linear network instead of whole space, then the distance between two points is defined in a different manner, and we require a classifier for the linearly clustered data. In this study, we address the classification problem on a tree-shaped linear network. First, we select a point on the edges to split the space, and then construct a decision tree with recursive splits. We propose an adaptive boost algorithm using this decision tree as a weak classifier. Finally, we provide some simulated examples and real data analysis, comparing with a decision tree based on Cartesian coordinates. The proposed method has better accuracy than the contrast method, when the observations are clustered on linear network.

Keywords: Classification, Decision tree, Adaptive boosting, Linear network

Outlier Detection of Functional Data on the Semiconductor Manufacturing Process (김민주, 한양대학교 응용통계학과) [14:50~15:10]

Semiconductor manufacturing involves more than 250 process steps, and if any of these steps involve equipment operating outside the allowable tolerance, it can lead to performance degradation, yield reduction, or even product disposal. In this study, six critical sensors for monitoring purposes were selected by domain experts for the equipment used. This presentation applies anomaly detection models to functional data measured by these six sensors to detect anomalies in wafers. It is anticipated that when new wafers undergo additional etching processes, abnormal data can be effectively detected using appropriate statistical models.

Keywords: Functional data, Anomaly detection, Outlier detection

Coffee Break [15:10~15:30]

Session 2: Spatiotemproal Data Analysis (Chair: 김준표) [15:30~17:00]

Lightning Strikes Modeling in Korea using Nonparametric Space-Time Hawkes Processes (박선철, 한양대학교 수학과) [15:30~16:10]

In this presentation, we analyze the spatio-temporal pattern of lightning strikes in Korea in 2022. From the results of descriptive statistics, we find that most lightning strikes are gathered in a very short time period. To analyze this phenomenon, we define accumulated lightning strike events and automatically detect these events. For the spatio-temporal modeling of individual accumulated lightning strike events, we used nonparametric Hawkes process modeling, which provides flexible modeling of the self-exciting process. In addition, we compare corresponding climate conditions to figure out which components are closely related to lightning strike events.

Keywords: Spatio-temporal data, Hawkes process, Lightning strikes, Climate data

Bathtub Curve Transition Point Detection Using the DAEM Algorithm and Barrier method (최지수, 서울대학교 통계학과) [16:10~16:40]

The bathtub curve is a reliability model that represents the early failure, stable, and aging phases of a product's lifecycle. It is divided into three stages: the early period, characterized by a high failure rate; the constant period, where the hazard rate remains stable; and the wear-out period, during which the failure rate increases. Accurately analyzing the change points of these stages is crucial for quality and safety management. This study proposes a method to analyze change points using an EM (Expectation-Maximization) algorithm with a three-component mixture Weibull distribution model. In the M-step, the “barrier method” is employed to estimate parameters, while in the E-step, the “Deterministic Annealing EM Algorithm” is utilized to estimate latent variables. Using the estimated parameters, the posterior probability distribution is applied to determine the most likely change points in each failure stage. This approach effectively detects the change points in both the burn-in process and the wear-out phase.

Keywords: Bathtub curve, Deterministic annealing, EM, Barrier method

Clustering of Spatio-Temporal Trajectories of Mountain Hikers (백승연, 한만휘, 한양대학교 수학과) [16:40~17:00]

In this paper, we suggest a clustering method to analyze the characteristics of mountain tracking patterns. The GPS-Trajectory data was obtained from the exercise app. Since the inaccuracy of GPS measurements, the data have outliers, unexpected velocities, and missing values. Therefore, we first suggest an automatic data-cleaning method for the analysis. To reflect complex spatio-temporal patterns of GPS-Trajectories, the proposed clustering method consists of a weighted average of geographical, temporal, and velocity similarities. We suggest a data-adaptive weight selection with appropriate constraints. Real data analysis with clustering results will be provided.

Keywords; Spatio-temporal trajectory, Similarity, Clustering

저녁 프로그램: 금강 자료 분석 세미나 & EVA 2025 Data Challenge [19:00~20:30]

An Introduction to Water Quality Data on Geum-River Network (박선철, 한양대학교 수학과) [19:00~19:40]

In this presentation, we briefly introduce the Geum-River Network Data, obtained from the Water Environment Information System, operated by the National Institute of Environmental Research. Water quality data is an example of a spatio-temporal and multivariate dataset since various measurements, such as Total Nitrogen (TN) and Total Organic Caborn (TOC), are obtained from water quality observation stations. They are located in the Geum-River, which can be described as a linear network. Therefore, a better understanding of this dataset also brings research questions in multivariate or spatial statistics. Recent updates to the dataset are also included in this presentation.

Keywords: River network

Extreme Value Analysis 2025 Data Challenge (강승우, 서울대학교 통계학과) [20:00~20:30]

The 14th International Conference on Extreme Value Analysis hosted by the University of North Carolina at Chapel Hill organizes the data competition about estimating extreme precipitation from a large ensemble of climate model runs. The data competition requires predictions of three target quantities. This talk briefly introduces the problems, datasets, and some papers to be considered for the data competition.

Keywords: Extreme value analysis

2월 15일 (토)

Session 3: Statistical Analysis of Network Data (Chair: 박민수) [9:00~10:30]

Expectile-Based Probabilistic Forecasting for Spatio-Temporal River Networks (김준표, 세종대학교 수학통계학과) [9:00~9:40]

In this paper, we present a novel approach for a probabilistic forecast based on expectile smoothing of river network data. The Miho River dataset, which is the focus of this study, contains spatio-temporal observations across a stream network. Since the inherent structure of the stream network should be considered, and time points are irregular and vary across observation sites, developing a forecasting method poses significant challenges. To address this we extend the flexible smoothing method using B-spline bases proposed by O'Donnell et al. (2014) by incorporating expectile regression to obtain information beyond the mean response for river network data analysis. Furthermore, we propose a probabilistic forecast method by predicting the expectile process from a functional data forecasting method by Aue et al. (2015). We demonstrate the result of the proposed method with the Miho River data and evaluate its performance.

Keywords: River network, Spatio-tempotal forecasting, Expectile, Probabilistic forecasting

Nitrogen Levels in the Miho River Network: A Quantile Regression Approach (이연제, 세종대학교 수학통계학과) [9:40~10:00]

This study investigates the nitrogen levels in the Miho River network using quantile regression, based on spatial observations. The structure of the stream network induces challenges in the estimation process. To address this, O'Donnell proposed a penalty approach that measures smoothness across each flow path at confluence points, with weights determined by relative flow volumes. Building on this framework, we explore whether O'Donnell's penalty can be altered to an L1 form, enabling its application in a quantile regression approach on the Miho River network.

Keywords: Quantile regression, L1 penalty

Graph Frequency-Domain Factor Modeling* (김규순, 서울대학교 통계학과) [10:00~10:30]

We propose a new factor model in the graph frequency domain for multivariate data lying on the vertices of a graph, called multivariate graph signal. Utilizing graph filters, our model extends the frequency domain approach of the dynamic factor model from time series to graphs, enabling a multiscale interpretation of factors across graph frequencies. This approach reduces the dimensionality of graph signals and improves the understanding of their structure. It also allows the use of the extracted factors as the basis for subsequent analyses, such as clustering. We describe the estimation of factors and their loadings and investigate the consistency of the factor estimator. In addition, we propose two consistent estimators for determining the number of factors. The finite sample performance of the proposed method is demonstrated through simulation studies under different graph structures. Furthermore, we show the effectiveness of the proposed method by applying it to the water quality parameter data from the Miho-Cheon catchment on the Geum River network and passenger data from the Seoul Metropolitan Subway.

Keywords: Dimension reduction, Factor analysis, Frequency domain, Graph signal processing, Multivariate graph signal

Break Time [10:30~10:40]

Session 4: Advanced Statistical Methods for Complex Data Analysis (Chair: 박선철) [10:40~12:00]

A Robust Method for Reconstructing the Spatial Conformation of the 3D Genome from Hi-C Data (박민수, 충남대학교 정보통계학과) [10:40~11:20]

The three-dimensional (3D) configuration of genomic architecture within the cell nucleus plays a pivotal role in biological processes such as transcriptional regulation and is implicated in pathogenesis through structural anomalies, including aberrant chromatin looping and genomic deletions. Before the advent of Chromosome Conformation Capture (3C) technologies, elucidating the 3D genome architecture was hampered by the inability to resolve complex spatial arrangements at a molecular level with high resolution. The introduction of high-throughput 3C methods, notably Hi-C, along with advancements in genomic sequencing, has empowered researchers to scrutinize the high-resolution 3D organization of the genome with unprecedented detail. Hi-C methodology produces a contact count map, a symmetric matrix that quantifies the frequency of interactions across genomic loci throughout the genome. This count data facilitates the use of distance metrics to infer genome-wide architecture. However, translating this proximity information into an accurate 3D structure remains a significant challenge in computational biology, particularly in terms of chromatin structure analysis and biological representation. To address these challenges, numerous methodologies utilizing Hi-C data have been developed; however, none of these approaches have adequately accounted for noise sensitivity, often leading to results that are prone to noise interference. Therefore, we propose a novel robust method for inferring the 3D structure of the genome, specifically engineered to be resilient against noise. Our model incorporates a combination of Thin Plate Spline (TPS) and Non-Metric Multi-Dimensional Scaling (nMDS), strategically utilized to mitigate the effects of noise and to produce a smoothly defined 3D genomic structure. The performance of our methodology was evaluated using both simulated data comprising structures of 40 different sizes and shapes subjected to five levels of noise and real Hi-C data derived from the IMR90 cell line. As a result of comparative assessments, our method consistently outperformed existing models in terms of accuracy under all noise conditions. Additionally, the predictive validity of our approach was substantiated by comparing the results with 111 replicate conformations derived from Fluorescence in situ hybridization (FISH) images, thereby providing substantial empirical support for our model.

Keywords: Chromatin 3D structure, Genome reconstruction, Hi-C assay, Robust optimization

A Study on Classification Models for Colon Cancer from Chromatin Contact using Sparse Graph Neural Network (고민규, 충남대학교 정보통계학과) [11:20~11:40]

Colorectal cancer (CRC) is one of the most common types of cancer, causing significant physical and psychological burdens on patients. Early detection of CRC is critical for improving patient survival rates and quality of life. Recent advancements in deep learning have enabled the analysis of high-dimensional genomic data, offering more accurate predictions for cancer diagnosis. In this study, we utilize High-throughput chromosome conformation capture (Hi-C) data, which represents the three-dimensional folding structure of the genome, to develop a framework for predicting CRC status. Our framework focuses on generating a sparse weighted graph by identifying genomic regions with significantly high contact frequencies using optimal bandwidth selection. This sparse graph is then applied to a graph convolutional network (GCN) model, termed GCN_weight, for CRC prediction. We demonstrate that GCN_weight outperforms a traditional GCN model (GCN_binary), which utilizes a binary adjacency graph, and graph attention network (GAT) model in terms of prediction performance, computational cost. Specifically, GCN_weight achieved an accuracy of 92.2 ± 7.2% and an F1-score of 94.4 ± 5.3%, compared to GCN_binary’s accuracy of 69.4 ± 1.9% and F1-score of 81.9 ± 1.4% and GAT’s accuracy of 90.2 ± 7.1% and F1-score of 93.0 ± 5.2%. Furthermore, under identical experimental conditions, GCN_weight required only 28 minutes of training time, significantly less than the 2 hours and 27 minutes required by GCN_binary, while showing more stable loss convergence. These findings highlight the effectiveness of the proposed method for generating sparse weighted graphs and its ability to enhance both the performance and efficiency of GCN models for CRC prediction. Our proposed framework underscores the value of leveraging Hi-C data for cancer diagnostics, laying a foundation for broader applications in disease prediction and medical data analysis.

Keywords: High-throughput chromosome conformation capture data, optimal bandwidth selection, sparse weighted graph, graph convolutional network

A Study on Robust Outlier Detection with Skewness-Adjusted Fences: From Influence Function to Practical Applications (정윤채, 충남대학교 정보통계학과) [11:40~12:00]

Outlier detection is a critical component of data analysis, as it ensures data integrity and enhances the reliability of statistical inferences. However, traditional approaches, such as Tukey's boxplot, often struggle to identify outliers in skewed distributions, leading to inaccuracies and compromised results. While alternatives like the adjusted boxplot (Hubert and Vandervieren, 2008) address some of these challenges, they suffer from limitations, including computational inefficiency and reduced robustness under extreme skewness. In this study, we propose an outlier detection framework that integrates skewness-adjusted fences into an enhanced boxplot design. By leveraging robust skewness measures, our method directly addresses the limitations posed by skewed distributions, offering a principled and computationally efficient approach to outlier detection. Through extensive simulations and real-world applications, we demonstrate that the proposed method consistently outperforms existing techniques in both accuracy and efficiency. This development provides a robust and practical solution for outlier detection in diverse data environments, with significant implications for statistical analysis and data-driven decision-making.

Keywords: Outlier Detection, Robust Skewness, Influence Function, Skewed Data

Closing Remark: 오희석 [12:00~12:10]

* 발표 녹화된 영상으로 진행
이 성과는 정부(과학기술정보통신부)의 재원으로 한국연구재단의 지원을 받아 수행된 연구임(No. 2021R1A2C1091357 & 2022M3J6A1084843).

Page updated

Google Sites

Report abuse