Publications

2024

[C34] RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks

S. Hwang, M. Kim, and S. E. Whang

Accepted to 30th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD), Barcelona, Spain, Aug. 2024. (Top Data Mining conference)

[C33] LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views

Y. Roh, Q. Liu, H. Gui, Z. Yuan, Y. Tang, S. E. Whang, L. Liu, S. Bi, L. Hong, E. H. Chi, and Z. Zhao

Accepted to 41st Int'l Conf. on Machine Learning (ICML), Vienna, Austria, July 2024. (Top Machine Learning conference)

[ArXiv] ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models

J. Oh, S. Kim, J. Seo, J. Wang, R. Xu, X. Xie, and S. E. Whang

[J14] Special Issue on Data-centric Responsible AI

S. E. Whang ed.
IEEE Data Engineering Bulletin, Mar. 2024.

[C32] Falcon: Fair Active Learning using Multi-armed Bandits

K. Tae, H. Zhang, J. Park, K. Rong, and S. E. Whang

In Proc. 50th Int'l Conf. on Very Large Data Bases (VLDB), Guangzhou, China, Aug. 2024. (Top Database conference)

[C31] Quilt: Robust Data Segment Selection against Concept Drifts

M. Kim, S. Hwang, and S. E. Whang

In Proc. 38th AAAI Conference on Artificial Intelligence (AAAI), Vancouver, Canada, Feb. 2024. (Top AI conference)

2023

[C30] Improving Fair Training under Correlation Shifts [Talk][Slides]

Y. Roh, K. Lee, S. E. Whang, and C. Suh

In Proc. 40th Int'l Conf. on Machine Learning (ICML), Honolulu, Hawaii, July 2023. (Top Machine Learning conference)

[J13] Dr-Fairness: Dynamic Data Ratio Adjustment for Fair Training on Real and Generated Data 

Y. Roh, W. Nie, D. Huang, S. E. Whang, A. Vahdat, A. Anandkumar

In Proc. Transactions on Machine Learning Research (TMLR),  2023. (New Machine Learning journal)

[C29] iFlipper: Label Flipping for Individual Fairness [Talk][Slides]

H. Zhang, K. Tae, J. Park, X. Chu, and S. E. Whang

In Proc. 2023 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), Seattle, Washington, June 2023. (Top Database conference)

[C28] Redactor: A Data-centric and Individualized Defense Against Inference Attacks [Talk]

G. Heo and S. E. Whang

In Proc. 37th AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, Feb. 2023. (Top AI conference)

[C27] XClusters: Explainability-first Clustering [Talk]

H. Hwang and S. E. Whang

In Proc. 37th AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, Feb. 2023. (Top AI conference)

[J12] Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective [Poster]

S. E. Whang, Y. Roh, H. Song, and J. Lee
The VLDB Journal, 2023. (Top Database journal)

2021

[C26] Sample Selection for Fair and Robust Training [Talk][Slides][Code]

Y. Roh, K. Lee, S. E. Whang, and C. Suh

In Proc. 35th Annual Conference on Neural Information Processing Systems (NeurIPS), Dec. 2021. (Top Machine Learning conference)

[T3] Machine Learning Robustness, Fairness, and their Convergence (Tutorial) [Talk][Homepage][Slides]

J. Lee, Y. Roh, H. Song, and S. E. Whang

In Proc. 27th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD), Aug. 2021. (Top Data Mining conference) 

[J11] Responsible AI Challenges in End-to-end Machine Learning

S. E. Whang, K. Tae, Y. Roh, and G. Heo

IEEE Data Engineering Bulletin, Mar. 2021. 

[C25] Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models [Talk][Slides][Code]

K. Tae and S. E. Whang

In Proc. 2021 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), June 2021. (Top Database conference)

[C24] FairBatch: Batch Selection for Model Fairness [Talk][Slides][Code]

Y. Roh, K. Lee, S. E. Whang, and C. Suh

In Proc. 9th Int'l Conference on Learning Representations (ICLR), May 2021. (Top Machine Learning conference)

[C23] Inspector Gadget: A Data Programming-based Labeling System for Industrial Images [Talk]

G. Heo, Y. Roh, S. Hwang, D. Lee, and S. E. Whang

In  Proc. 47th Int'l Conf. on Very Large Data Bases (VLDB), Copenhagen, Denmark, Aug. 2021. (Top Database conference)

[J10] A Survey on Data Collection for Machine Learning: a Big Data - AI Integration Perspective

Y. Roh, G. Heo, and S. E. Whang

In IEEE Transactions on Knowledge and Data Engineering (TKDE), April 2021. (Top Database / Data Mining journal)

2020

[W4] Open-world COVID-19 Data Visualization (Extended Abstract)

H. Hwang and S. E. Whang

In Proc. 6th Data Management and Analytics for Medicine and Healthcare (DMAH @ VLDB) Workshop, Tokyo, Japan, Sept. 2020.

[C22] FR-Train: A Mutual Information-based Approach to Fair and Robust Training [Talk][Slides]

Y. Roh, K. Lee, S. E. Whang, and C. Suh

In Proc. 37th Int'l Conf. on Machine Learning (ICML), July 2020. (Top Machine Learning conference)

[T2] Data Collection and Quality Challenges for Deep Learning (Tutorial) [Talk][Slides]

S. E. Whang and J. Lee

In Proc. 46th Int'l Conf. on Very Large Data Bases (VLDB), Tokyo, Japan, Sept. 2020. (Top Database conference)

[J9] Automated Data Slicing for Model Validation: A Big data - AI Integration Approach

Y. Chung, T. Kraska, N. Polyzotis, K. Tae, and S. E. Whang

In IEEE Transactions on Knowledge and Data Engineering (TKDE) , Dec. 2020. (Top Database / Data Mining journal)

2019

[W3] Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

K. Tae, Y. Roh, Y. Oh, H. Kim, and S. E. Whang

In 3rd Int'l Workshop on Data Management for End-to-End Machine Learning, DEEM @ ACM SIGMOD, June 2019. 

[C21] Slice Finder: Automated Data Slicing for Model Validation

Y. Chung, T. Kraska, N. Polyzotis, K. Tae, and S. E. Whang

In IEEE Int'l Conf. on Data Engineering (ICDE), Macau SAR, China, Apr. 2019. Short paper. [Poster] (Top-3 Database conference)

[C20] Data Validation for Machine Learning

E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich

In MLSys Conference, Stanford, California, Mar. 2019. (First proceedings)

2018

[J8] Data Lifecycle Challenges in Production Machine Learning: A Survey

N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich

ACM SIGMOD Record, vol. 47, no. 2, pp. 17-28, June 2018

[C19] TFX Frontend: A Graphical User Interface for a Production-Scale Machine Learning Platform

P. Brandt, J. Cai, T. Gannert, P. Joshi, R. Khot, C. Koo, C. Kuang, S. Leong, C. Mewald, N. Polyzotis, H. Quiroz, S. Roy, P. Yang, J. Wexler, and S. E. Whang

MLSys Conference, Stanford, California, Feb. 2018.

[C18] Slice Finder: Automated Data Slicing for Model Interpretability

Y. Chung, T. Kraska, N. Polyzotis, and S. E. Whang

MLSys Conference, Stanford, California, Feb. 2018.

[C17] Data Infrastructure for Machine Learning

E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich

MLSys Conference, Stanford, California, Feb. 2018.

2017 and before

[C16] TFX: A TensorFlow-Based Production-Scale Machine Learning Platform

D. Baylor, E. Breck, H. Cheng, N. Fiedel, C. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, C. Koo, L. Lew, C. Mewald, A. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang, M. Wicke, J. Wilkiewicz, X. Zhang, and M. Zinkevich

In Proc. 23rd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD), pp. 1387-1395, Halifax, Nova Scotia, Canada, Aug., 2017.

[T1] Data Management Challenges in Production Machine Learning

N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich

In Proc. 2017 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), pp. 1723-1726, Chicago, May 2017. Tutorial.

[J7] Managing Google's data lake: an overview of the Goods system

A. Halevy, F. Korn, N. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang

IEEE Data Engineering Bulletin, vol. 39, no. 3, pp. 5-14, Sept. 2016.

[C15] Lonlies : Estimating Property Values for Long Tail Entities

M. Farid, I. Ilyas, S. E. Whang, and C. Yu

In Proc. 39th Int'l ACM SIGIR Conf. on Research and Development on Information Retrieval, pp. 1125-1128, Pisa, Italy, July 2016. Demonstration description.

[C14] Goods: Organizing Google's Datasets

A. Halevy, F. Korn, N. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang

In Proc. 2016 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), pp. 795-806, San Francisco, June 2016.

[C13] Discovering Structure in the Universe of Attribute Names

A. Halevy, N. Noy, S. Sarawagi, S. E. Whang, and X. Yu

In Proc. 25th Int'l Conf. on World Wide Web (WWW), pp. 939-949, Montreal, Canada, Apr. 2016.

[W2] Discovering Subsumption Relationships for Web-Based Ontologies

D. Movshovitz-Attias, S. E. Whang, N. Noy, and A. Halevy

In Proc. 18th Int'l Workshop on the Web and Databases (WebDB), pp. 62-69, Melbourne, Australia, May 2015. (Best Paper Award)

[C12] ReNoun: Fact Extraction for Nominal Attributes

M. Yahya, S. E. Whang, R. Gupta, and A. Halevy

In Proc. 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 325-335, Doha, Qatar, Oct. 2014.

[C11] Biperpedia: An Ontology for Search Applications

R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu

In Proc. 40th Int'l Conf. on Very Large Data Bases (VLDB), pp. 505-516, Hangzhou, China, Sept. 2014.

[J6] Incremental Entity Resolution on Rules and Data

S. E. Whang and H. Garcia-Molina

The VLDB Journal, vol. 23, no. 1, pp. 77-102, Jan. 2014.

[J5] Joint Entity Resolution on Multiple Datasets

S. E. Whang and H. Garcia-Molina

The VLDB Journal, vol. 22, no. 6, pp. 773-795, Nov. 2013.

[C10] Disinformation Techniques for Entity Resolution

S. E. Whang and H. Garcia-Molina

In Proc. 22nd ACM Int'l Conf. on Information and Knowledge Management (CIKM), pp. 715-720, San Francisco, California, Oct. 2013. Short Paper.

[C9] Question Selection for Crowd Entity Resolution

S. E. Whang, P. Lofgren, and H. Garcia-Molina

In Proc. 39th Int'l Conf. on Very Large Data Bases (VLDB), pp. 349-360, Trento, Italy, Aug. 2013.

[J4] Pay-As-You-Go Entity Resolution

S. E. Whang, D. Marmaros, and H. Garcia-Molina

IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 25, no. 5, pp. 1111-1124, May 2013.

[W1] A Model for Quantifying Information Leakage

S. E. Whang and H. Garcia-Molina

In Proc. 9th VLDB Workshop on Secure Data Management (SDM), pp. 25-44, Istanbul, Turkey, Aug. 2012.

[C8] Joint Entity Resolution

S. E. Whang and H. Garcia-Molina

In Proc. 28th IEEE Int'l Conf. on Data Engineering (ICDE), pp. 294-305, Washington, DC, Apr. 2012. Full Paper.

[J3] Developments in Generic Entity Resolution

S. E. Whang and H. Garcia-Molina

IEEE Data Engineering Bulletin, vol. 34, no. 3, pp. 51-59, Sept. 2011.

[C7] Managing Information Leakage

S. E. Whang and H. Garcia-Molina

In Proc. 5th Biennial Conf. on Innovative Data Systems Research (CIDR), pp. 79-84, Pacific Grove,   California, Jan. 2011.

[C6] Entity Resolution with Evolving Rules

S. E. Whang and H. Garcia-Molina

In Proc. 36th Int'l Conf. on Very Large Data Bases (VLDB), pp. 1326-1337, Singapore, Sept. 2010.

[C5] Evaluating Entity Resolution Results

D. Menestrina, S. E. Whang, and H. Garcia-Molina

In Proc. 36th Int'l Conf. on Very Large Data Bases (VLDB), pp. 208-219, Singapore, Sept. 2010.

[C4] Indexing Boolean Expressions

S. E. Whang, C. Brower, J. Shanmugasundaram, S. Vassilvitskii, E. Vee, R. Yerneni, and H. Garcia-Molina

In Proc. 35th Int'l Conf. on Very Large Data Bases (VLDB), pp. 37-48, Lyon, France, Aug. 2009.

[C3] Entity Resolution with Iterative Blocking

S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina

In Proc. 2009 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), pp. 219-232, Providence, Rhode Island, June 2009.

[C2] QuickStart: an Upfront Client-based Design Advisor for Parallel Data Warehouses

M. Castellanos, I. Jimenez, N. Coddington, H. Zeller, S. Whang, and U. Dayal

In Proc. 25th Int'l Conf. on Data Engineering (ICDE), pp. 1543-1546, Shanghai, China, Mar. 2009. Demonstration description.

[J2] Generic Entity Resolution with Negative Rules

S. E. Whang, O. Benjelloun, and H. Garcia-Molina

The VLDB Journal, vol. 18, no. 6, pp. 1261-1277, Feb. 2009.

[J1] Swoosh: A Generic Approach to Entity Resolution

O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom

The VLDB Journal, vol. 18, no. 1, pp. 255-276, Jan. 2009.

[C1] A Practitioner's Approach to Normalizing XQuery Expressions

K. Lee, S. Kim, E. Whang, and J. Lee

In Proc. 11th Int'l Symposium on Database Systems for Advanced Applications (DASFAA), pp. 437-453, Hilton Hotel, Singapore, Apr. 2006.

Thesis

Data Analytics: Integration and Privacy

S. E. Whang

Ph.D. Thesis, Stanford University, June 2012.