Publications
2024
[C34] RC-Mixup: A Data Augmentation Strategy against Noisy Data for Regression Tasks
S. Hwang, M. Kim, and S. E. Whang
Accepted to 30th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD), Barcelona, Spain, Aug. 2024. (Top Data Mining conference)
[C33] LEVI: Generalizable Fine-tuning via Layer-wise Ensemble of Different Views
Y. Roh, Q. Liu, H. Gui, Z. Yuan, Y. Tang, S. E. Whang, L. Liu, S. Bi, L. Hong, E. H. Chi, and Z. Zhao
Accepted to 41st Int'l Conf. on Machine Learning (ICML), Vienna, Austria, July 2024. (Top Machine Learning conference)
J. Oh, S. Kim, J. Seo, J. Wang, R. Xu, X. Xie, and S. E. Whang
[J14] Special Issue on Data-centric Responsible AI
S. E. Whang ed.
IEEE Data Engineering Bulletin, Mar. 2024.
[C32] Falcon: Fair Active Learning using Multi-armed Bandits
K. Tae, H. Zhang, J. Park, K. Rong, and S. E. Whang
In Proc. 50th Int'l Conf. on Very Large Data Bases (VLDB), Guangzhou, China, Aug. 2024. (Top Database conference)
[C31] Quilt: Robust Data Segment Selection against Concept Drifts
M. Kim, S. Hwang, and S. E. Whang
In Proc. 38th AAAI Conference on Artificial Intelligence (AAAI), Vancouver, Canada, Feb. 2024. (Top AI conference)
2023
[C30] Improving Fair Training under Correlation Shifts [Talk][Slides]
Y. Roh, K. Lee, S. E. Whang, and C. Suh
In Proc. 40th Int'l Conf. on Machine Learning (ICML), Honolulu, Hawaii, July 2023. (Top Machine Learning conference)
[J13] Dr-Fairness: Dynamic Data Ratio Adjustment for Fair Training on Real and Generated Data
Y. Roh, W. Nie, D. Huang, S. E. Whang, A. Vahdat, A. Anandkumar
In Proc. Transactions on Machine Learning Research (TMLR), 2023. (New Machine Learning journal)
[C29] iFlipper: Label Flipping for Individual Fairness [Talk][Slides]
H. Zhang, K. Tae, J. Park, X. Chu, and S. E. Whang
In Proc. 2023 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), Seattle, Washington, June 2023. (Top Database conference)
[C28] Redactor: A Data-centric and Individualized Defense Against Inference Attacks [Talk]
G. Heo and S. E. Whang
In Proc. 37th AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, Feb. 2023. (Top AI conference)
[C27] XClusters: Explainability-first Clustering [Talk]
H. Hwang and S. E. Whang
In Proc. 37th AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, Feb. 2023. (Top AI conference)
[J12] Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective [Poster]
S. E. Whang, Y. Roh, H. Song, and J. Lee
The VLDB Journal, 2023. (Top Database journal)
2021
[C26] Sample Selection for Fair and Robust Training [Talk][Slides][Code]
Y. Roh, K. Lee, S. E. Whang, and C. Suh
In Proc. 35th Annual Conference on Neural Information Processing Systems (NeurIPS), Dec. 2021. (Top Machine Learning conference)
[T3] Machine Learning Robustness, Fairness, and their Convergence (Tutorial) [Talk][Homepage][Slides]
J. Lee, Y. Roh, H. Song, and S. E. Whang
In Proc. 27th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD), Aug. 2021. (Top Data Mining conference)
[J11] Responsible AI Challenges in End-to-end Machine Learning
S. E. Whang, K. Tae, Y. Roh, and G. Heo
IEEE Data Engineering Bulletin, Mar. 2021.
[C25] Slice Tuner: A Selective Data Acquisition Framework for Accurate and Fair Machine Learning Models [Talk][Slides][Code]
K. Tae and S. E. Whang
In Proc. 2021 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), June 2021. (Top Database conference)
[C24] FairBatch: Batch Selection for Model Fairness [Talk][Slides][Code]
Y. Roh, K. Lee, S. E. Whang, and C. Suh
In Proc. 9th Int'l Conference on Learning Representations (ICLR), May 2021. (Top Machine Learning conference)
[C23] Inspector Gadget: A Data Programming-based Labeling System for Industrial Images [Talk]
G. Heo, Y. Roh, S. Hwang, D. Lee, and S. E. Whang
In Proc. 47th Int'l Conf. on Very Large Data Bases (VLDB), Copenhagen, Denmark, Aug. 2021. (Top Database conference)
[J10] A Survey on Data Collection for Machine Learning: a Big Data - AI Integration Perspective
Y. Roh, G. Heo, and S. E. Whang
In IEEE Transactions on Knowledge and Data Engineering (TKDE), April 2021. (Top Database / Data Mining journal)
2020
[W4] Open-world COVID-19 Data Visualization (Extended Abstract)
H. Hwang and S. E. Whang
In Proc. 6th Data Management and Analytics for Medicine and Healthcare (DMAH @ VLDB) Workshop, Tokyo, Japan, Sept. 2020.
[C22] FR-Train: A Mutual Information-based Approach to Fair and Robust Training [Talk][Slides]
Y. Roh, K. Lee, S. E. Whang, and C. Suh
In Proc. 37th Int'l Conf. on Machine Learning (ICML), July 2020. (Top Machine Learning conference)
[T2] Data Collection and Quality Challenges for Deep Learning (Tutorial) [Talk][Slides]
S. E. Whang and J. Lee
In Proc. 46th Int'l Conf. on Very Large Data Bases (VLDB), Tokyo, Japan, Sept. 2020. (Top Database conference)
[J9] Automated Data Slicing for Model Validation: A Big data - AI Integration Approach
Y. Chung, T. Kraska, N. Polyzotis, K. Tae, and S. E. Whang
In IEEE Transactions on Knowledge and Data Engineering (TKDE) , Dec. 2020. (Top Database / Data Mining journal)
2019
[W3] Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach
K. Tae, Y. Roh, Y. Oh, H. Kim, and S. E. Whang
In 3rd Int'l Workshop on Data Management for End-to-End Machine Learning, DEEM @ ACM SIGMOD, June 2019.
[C21] Slice Finder: Automated Data Slicing for Model Validation
Y. Chung, T. Kraska, N. Polyzotis, K. Tae, and S. E. Whang
In IEEE Int'l Conf. on Data Engineering (ICDE), Macau SAR, China, Apr. 2019. Short paper. [Poster] (Top-3 Database conference)
[C20] Data Validation for Machine Learning
E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich
In MLSys Conference, Stanford, California, Mar. 2019. (First proceedings)
2018
[J8] Data Lifecycle Challenges in Production Machine Learning: A Survey
N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich
ACM SIGMOD Record, vol. 47, no. 2, pp. 17-28, June 2018
[C19] TFX Frontend: A Graphical User Interface for a Production-Scale Machine Learning Platform
P. Brandt, J. Cai, T. Gannert, P. Joshi, R. Khot, C. Koo, C. Kuang, S. Leong, C. Mewald, N. Polyzotis, H. Quiroz, S. Roy, P. Yang, J. Wexler, and S. E. Whang
MLSys Conference, Stanford, California, Feb. 2018.
[C18] Slice Finder: Automated Data Slicing for Model Interpretability
Y. Chung, T. Kraska, N. Polyzotis, and S. E. Whang
MLSys Conference, Stanford, California, Feb. 2018.
[C17] Data Infrastructure for Machine Learning
E. Breck, N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich
MLSys Conference, Stanford, California, Feb. 2018.
2017 and before
[C16] TFX: A TensorFlow-Based Production-Scale Machine Learning Platform
D. Baylor, E. Breck, H. Cheng, N. Fiedel, C. Foo, Z. Haque, S. Haykal, M. Ispir, V. Jain, L. Koc, C. Koo, L. Lew, C. Mewald, A. Modi, N. Polyzotis, S. Ramesh, S. Roy, S. E. Whang, M. Wicke, J. Wilkiewicz, X. Zhang, and M. Zinkevich
In Proc. 23rd ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining (KDD), pp. 1387-1395, Halifax, Nova Scotia, Canada, Aug., 2017.
[T1] Data Management Challenges in Production Machine Learning
N. Polyzotis, S. Roy, S. E. Whang, and M. Zinkevich
In Proc. 2017 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), pp. 1723-1726, Chicago, May 2017. Tutorial.
[J7] Managing Google's data lake: an overview of the Goods system
A. Halevy, F. Korn, N. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang
IEEE Data Engineering Bulletin, vol. 39, no. 3, pp. 5-14, Sept. 2016.
[C15] Lonlies : Estimating Property Values for Long Tail Entities
M. Farid, I. Ilyas, S. E. Whang, and C. Yu
In Proc. 39th Int'l ACM SIGIR Conf. on Research and Development on Information Retrieval, pp. 1125-1128, Pisa, Italy, July 2016. Demonstration description.
[C14] Goods: Organizing Google's Datasets
A. Halevy, F. Korn, N. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang
In Proc. 2016 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), pp. 795-806, San Francisco, June 2016.
[C13] Discovering Structure in the Universe of Attribute Names
A. Halevy, N. Noy, S. Sarawagi, S. E. Whang, and X. Yu
In Proc. 25th Int'l Conf. on World Wide Web (WWW), pp. 939-949, Montreal, Canada, Apr. 2016.
[W2] Discovering Subsumption Relationships for Web-Based Ontologies
D. Movshovitz-Attias, S. E. Whang, N. Noy, and A. Halevy
In Proc. 18th Int'l Workshop on the Web and Databases (WebDB), pp. 62-69, Melbourne, Australia, May 2015. (Best Paper Award)
[C12] ReNoun: Fact Extraction for Nominal Attributes
M. Yahya, S. E. Whang, R. Gupta, and A. Halevy
In Proc. 2014 Conf. on Empirical Methods in Natural Language Processing (EMNLP), pp. 325-335, Doha, Qatar, Oct. 2014.
[C11] Biperpedia: An Ontology for Search Applications
R. Gupta, A. Halevy, X. Wang, S. E. Whang, and F. Wu
In Proc. 40th Int'l Conf. on Very Large Data Bases (VLDB), pp. 505-516, Hangzhou, China, Sept. 2014.
[J6] Incremental Entity Resolution on Rules and Data
S. E. Whang and H. Garcia-Molina
The VLDB Journal, vol. 23, no. 1, pp. 77-102, Jan. 2014.
[J5] Joint Entity Resolution on Multiple Datasets
S. E. Whang and H. Garcia-Molina
The VLDB Journal, vol. 22, no. 6, pp. 773-795, Nov. 2013.
[C10] Disinformation Techniques for Entity Resolution
S. E. Whang and H. Garcia-Molina
In Proc. 22nd ACM Int'l Conf. on Information and Knowledge Management (CIKM), pp. 715-720, San Francisco, California, Oct. 2013. Short Paper.
[C9] Question Selection for Crowd Entity Resolution
S. E. Whang, P. Lofgren, and H. Garcia-Molina
In Proc. 39th Int'l Conf. on Very Large Data Bases (VLDB), pp. 349-360, Trento, Italy, Aug. 2013.
[J4] Pay-As-You-Go Entity Resolution
S. E. Whang, D. Marmaros, and H. Garcia-Molina
IEEE Transactions on Knowledge and Data Engineering (TKDE), vol. 25, no. 5, pp. 1111-1124, May 2013.
[W1] A Model for Quantifying Information Leakage
S. E. Whang and H. Garcia-Molina
In Proc. 9th VLDB Workshop on Secure Data Management (SDM), pp. 25-44, Istanbul, Turkey, Aug. 2012.
S. E. Whang and H. Garcia-Molina
In Proc. 28th IEEE Int'l Conf. on Data Engineering (ICDE), pp. 294-305, Washington, DC, Apr. 2012. Full Paper.
[J3] Developments in Generic Entity Resolution
S. E. Whang and H. Garcia-Molina
IEEE Data Engineering Bulletin, vol. 34, no. 3, pp. 51-59, Sept. 2011.
[C7] Managing Information Leakage
S. E. Whang and H. Garcia-Molina
In Proc. 5th Biennial Conf. on Innovative Data Systems Research (CIDR), pp. 79-84, Pacific Grove, California, Jan. 2011.
[C6] Entity Resolution with Evolving Rules
S. E. Whang and H. Garcia-Molina
In Proc. 36th Int'l Conf. on Very Large Data Bases (VLDB), pp. 1326-1337, Singapore, Sept. 2010.
[C5] Evaluating Entity Resolution Results
D. Menestrina, S. E. Whang, and H. Garcia-Molina
In Proc. 36th Int'l Conf. on Very Large Data Bases (VLDB), pp. 208-219, Singapore, Sept. 2010.
[C4] Indexing Boolean Expressions
S. E. Whang, C. Brower, J. Shanmugasundaram, S. Vassilvitskii, E. Vee, R. Yerneni, and H. Garcia-Molina
In Proc. 35th Int'l Conf. on Very Large Data Bases (VLDB), pp. 37-48, Lyon, France, Aug. 2009.
[C3] Entity Resolution with Iterative Blocking
S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina
In Proc. 2009 ACM SIGMOD Int'l Conf. on Management of Data (SIGMOD), pp. 219-232, Providence, Rhode Island, June 2009.
[C2] QuickStart: an Upfront Client-based Design Advisor for Parallel Data Warehouses
M. Castellanos, I. Jimenez, N. Coddington, H. Zeller, S. Whang, and U. Dayal
In Proc. 25th Int'l Conf. on Data Engineering (ICDE), pp. 1543-1546, Shanghai, China, Mar. 2009. Demonstration description.
[J2] Generic Entity Resolution with Negative Rules
S. E. Whang, O. Benjelloun, and H. Garcia-Molina
The VLDB Journal, vol. 18, no. 6, pp. 1261-1277, Feb. 2009.
[J1] Swoosh: A Generic Approach to Entity Resolution
O. Benjelloun, H. Garcia-Molina, D. Menestrina, Q. Su, S. E. Whang, and J. Widom
The VLDB Journal, vol. 18, no. 1, pp. 255-276, Jan. 2009.
[C1] A Practitioner's Approach to Normalizing XQuery Expressions
K. Lee, S. Kim, E. Whang, and J. Lee
In Proc. 11th Int'l Symposium on Database Systems for Advanced Applications (DASFAA), pp. 437-453, Hilton Hotel, Singapore, Apr. 2006.
Thesis
Data Analytics: Integration and Privacy
S. E. Whang
Ph.D. Thesis, Stanford University, June 2012.