Chuan Xiao - Research

Research

Current Projects

Smart Agent-Based Modeling

Smart agents are intelligent, adaptive, and computational entities. While humans are the canonical smart agents, the advent of foundation models - imbued with remarkable language, vision, and reasoning abilities that emulate human behavior - enables us to expand the concept of smart agents to agent-based modeling (ABM). This evolution leads to the introduction of smart agent-based modeling (SABM). Unlike traditional ABM, SABM incorporates foundation models as agents and formulates models using natural language. We employ SABM to investigate natural processes across various fields such as economics and behavioral science. We believe that SABM offers a more nuanced and realistic approach to enhancing our comprehension of natural systems.

Selected Publications

Large Language Models as Urban Residents: An LLM Agent Framework for Personal Mobility Generation. [paper] [source code]
Shall We Talk: Exploring Spontaneous Collaborations of Competing LLM Agents. arXiv preprint. [paper] [source code]
Smart Agent-Based Modeling: On the Use of Large Language Models in Computer Simulations. arXiv preprint. [paper] [slides] [source code]
"Guinea Pig Trials" Utilizing GPT: A Novel Smart Agent-Based Modeling Approach for Studying Firm Competition and Collusion. CIST 2023 (non-archived). [paper] [slides] [source code]

Data Lake Management

With the trends of open data movements by governments and the dissemination of data lake solutions in industries, we are provided with more opportunities to obtain a huge number of tables from data lakes and make use of them to enrich our local data. We study several fundamental problems of data lake management, including data cleaning, data integration, and data augmentation. For example, our solutions are able to identify and suggest useful attributes to data science engineers.

Selected Publications

Jellyfish: A Large Language Model for Data Preprocessing. arXiv preprint. [paper] [slides] [7B model] [13B model]
BClean: A Bayesian Data Cleaning System. ICDE 2024. [paper] [source code]
DeepJoin: Joinable Table Discovery with Pre-trained Language Models. PVLDB 2023. [paper]
Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. ICDE 2021. [paper]

Data Science Methods for Digital Humanities

Data science methods, leveraging computational techniques and analytical tools, are increasingly vital in digital humanities, spanning disciplines such as sociology, psychology, history, and education. By employing statistical analysis, machine learning, and data visualization, researchers can uncover patterns and insights in large datasets, ranging from historical documents to social media trends. This interdisciplinary approach enables a deeper analysis of cultural, social, and historical phenomena, transforming traditional humanities research into a more dynamic and data-driven field.

Selected Publications

Utilization of Information Entropy in Training and Evaluation of Students' Abstraction Performance and Algorithm Efficiency in Programming. ToE. [paper]

Similarity Query Processing

A similarity query is to find similar objects in one or more datasets. It is an important operation in many applications, such as entity matching, plagiarism detection, and image retrieval. We target a variety of data types (sets, strings, high-dimensional data, etc.) and develop efficient query processing methods.

Selected Publications

- Generic Algorithms -

Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach. SIGMOD 2020. [paper] [slides] [source code]
Pigeonring: A Principle for Faster Thresholded Similarity Search. PVLDB 2018. [paper] [slides]

- High-Dimensional Data -

Probabilistic Routing for Graph-Based Approximate Nearest Neighbor Search. ICML 2024. [paper] [source code]
MQH: Locality Sensitive Hashing on Multi-level Quantization Errors for Point-to-Hyperplane Distances. PVLDB 2022. [paper] [slides] [source code]
HVS: Hierarchical Graph Structure Based on Voronoi Diagrams for Solving Approximate Nearest Neighbor Search. PVLDB 2021. [paper] [slides] [source code]
High-Dimensional Similarity Query Processing for Data Science. KDD 2021 (tutorial). [slides]
Consistent and Flexible Selectivity Estimation for High-Dimensional Data. SIGMOD 2021. [paper] [slides] [source code]
Similarity Query Processing for High-Dimensional Data. PVLDB 2020 (tutorial). [slides]
GPH: Similarity Search in Hamming Space. ICDE 2018/TKDE. [conference paper] [journal extension] [slides] [source code]

- Sets -

Dynamic Set kNN Self-Join. ICDE 2019. [paper]
Set Similarity Query Processing. WISE 2017 (tutorial). [slides]
Local Similarity Search for Unstructured Text. SIGMOD 2016. [paper] [slides]
Top-k Set Similarity Joins. ICDE 2009. [paper] [slides]
Efficient Similarity Joins for Near Duplicate Detection. WWW 2008/TODS. [conference paper] [journal extension] [slides] [source code]

- Strings -

VChunkJoin: An Efficient Algorithm for Edit Similarity Joins. TKDE. [paper] [source code]
Asymmetric Signature Schemes for Efficient Exact Edit Similarity Query Processing. SIGMOD 2011/TODS. [conference paper] [journal extention] [slides] [source code]
Approximate Entity Extraction with Edit Constraints. SIGMOD 2009. [paper] [slides]
Ed-Join: An Efficient Algorithm for Similarity Join with Edit Distance Constraints. PVLDB 2008. [paper] [slides]

Past Projects

Trajectory Analysis in Road Networks

Querying and retrieving moving object trajectories in road networks is becoming important as they are key data in modern data-driven automotive applications, such as autonomous cars, cloud car navigation systems, and intelligent transportation systems. Our goal is to address fundamental problems of trajectory analysis in road networks.

Selected Publications

Fast Subtrajectory Similarity Search in Road Networks under Weighted Edit Distance Constraints. PVLDB 2020. [paper] [video]
Indexing Trajectories for Travel-Time Histogram Retrieval. EDBT 2019. [paper]
CiNCT: Compression and Retrieval for Massive Vehicular Trajectories via Relative Movement Labeling. ICDE 2018. [paper]
Enhanced Indexing and Querying of Trajectories in Road Networks via String Algorithms. TSAS. [paper]

Query Autocompletion

Autocompletion is an interactive feature that automatically completes an input while reducing the typing effort. It has been utilized in search engines, input methods (IMEs), integrated development environments (IDEs), and mobile applications. We develop novel autocompletion techniques that delivers high quality suggestions in an efficient way for various online services.

Selected Publications

Autocompletion for Prefix-Abbreviated Input. SIGMOD 2019. [paper] [slides]
Scope-Aware Code Completion with Discriminative Modeling. JIP. [paper]
An Efficient Algorithm for Location-Aware Query Autocompletion. IEICE Transactions. [paper]
BEVA: An Efficient Query Processing Algorithm for Error Tolerant Autocompletion. TODS. [paper]
Efficient Error-Tolerant Query Autocopletion. PVLDB 2013/VLDBJ. [conference paper] [journal extension] [slides] [source code]

Graph Structural Search

Graphs are widely used to model complex data in many applications, such as bioinformatics, chemistry, social networks, and pattern recognition. A fundamental and critical query primitive is to search structures in a large collection of graphs. We develop efficient methods to process advanced structural search in graph databases.

Selected Publications

Frequent Subgraph Mining Based on Pregel. The Computer Journal. [paper]
A Partition-Based Approach to Structure Similarity Search. PVLDB 2013/VLDBJ. [conference paper] [journal extension] [slides]
Improving Performance of Graph Similarity Joins Using Selected Substructures. DASFAA 2014. [paper]
Efficient Graph Similarity Joins with Edit Distance Constraints. ICDE 2012/VLDBJ. [conference paper] [journal extension] [slides]
Efficient Subgraph Similarity All-Matching. DASFAA 2012. [paper] [slides]