This project was carried out from 2000 to 2009. It studied schema/ontology matching, which is fundamental to many data management applications, including data integration, warehousing, mining, e-commerce, e-science, and Web data processing.
The project was very timely. Shortly after it started around 2000, this direction exploded into a major direction in data management, and has received much attention ever since. The main contributions of this project:
We showed how to apply machine learning to this problem.
We showed that multiple types of domain knowledge must be exploited to maximize matching accuracy.
We introduced a highly modular extensible system architecture, which is pretty much the common matching architecture used today.
We showed how to exploit domain knowledge (e.g., in form of other schemas) in matching.
We were among the first to develop clean solutions to several difficult problems, such as finding complex schema matches and matching ontologies.
One of the main lessons I learned from this project is that crowdsourcing could be ideal for such matching (and this in turn motivated my subsequent work on crowdsourcing).
People and Funding
AnHai Doan, Robert McCann, Robin Dhamanka, Yoonkyong Lee, Mayssam Sayyadian, Wensheng Wu, Xiaoyong Chai.
Collaborators: Alon Halevy, Pedro Domingos, Phil Bernstein, Jayant Madhavan, Arnon Rosenthal, Len Seligman, Chris Clifton, Luis Gravano, Natasha Noy, Clement Yu.
We gratefully acknowledge support from grants CAREER IIS-0347903 and ITR 0428168, MITRE, and Google.
Publications
PhD Dissertation
Learning to Map between Structured Representations of Data, A. Doan. Ph.D. Dissertation, Univ. of Washington-Seattle, 2002. Received the ACM Doctoral Dissertation Award in 2003.
Basic Matching Techniques
Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach, A. Doan, P. Domingos, and A. Halevy. SIGMOD-2001.ppt slides. Other versions:
Learning Source Descriptions for Data Integration, A. Doan, P. Domingos, and A. Levy. WebDB-2000. (a preliminary version of the above paper, ppt slides)
Learning Mappings between Data Schemas , A. Doan, P. Domingos, and A. Levy. Proc. of the AAAI-2000 Workshop on Learning Statistical Models from Relational Data, 2000. (preliminary version)
Data Integration: A "Killer App" for Multi-Strategy Learning, A. Doan, P. Domingos, and A. Levy. Proc. of the Workshop on Multi-Strategy Learning (MSL-00), 2000. (preliminary version)
Learning to Match the Schemas of Databases: A Multistrategy Approach, A. Doan, P. Domingos, and A. Halevy. Machine Learning Journal, 50, Pages 279-301, 2003. (invited journal version)
Learning to Map between Ontologies on the Semantic Web, A. Doan, J. Madhavan, P. Domingos, and A. Halevy. WWW-2002. ppt slides. Other versions:
Learning to Match Ontologies on the Semantic Web, A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Halevy. VLDB Journal, Special Issue on the Semantic Web, 2003. (expanded version)
Ontology Matching: A Machine Learning Approach, A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Handbook on Ontologies in Information Systems, S. Staab and R. Studer (eds.), Springer-Velag, 2004. Invited paper. Pages 397-416.
iMAP: Discovering Complex Semantic Matches between Database Schemas, R. Dhamanka, Y. Lee, A. Doan, A. Halevy, and P. Domingos. SIGMOD-2004.
Crowdsourced Schema Matching
Building Data Integration Systems via Mass Collaboration, R. McCann, A. Doan, A. Kramnik, and V. Varadarajan. Proc. of the Int. Workshop on Web and Databases (WebDB-03).
Building Data Integration Systems: A Mass Collaboration Approach, A. Doan and R. McCann. Proc. of the IJCAI-03 Workshop on Information Integration on the Web.
Integrating Data from Disparate Sources: A Mass Collaboration Approach, R. McCann, A. Kramnik, W. Shen, V. Varadarajan, O. Sobulo, A. Doan. ICDE-05. Poster.
Matching Schemas in Online Communities: A Web 2.0 Approach, R. McCann, W. Shen, A. Doan. ICDE-08.
Matching Web Query Interfaces (on the Deep Web)
An Interactive Clustering-based Approach to Integrating Source Query interfaces on the Deep Web, W. Wu, C. Yu, A. Doan, and W. Meng. SIGMOD-04.
Merging Interface Schemas on the Deep Web via Clustering Aggregation, W. Wu, A. Doan, and C. Yu. IEEE Int. Conf. on Data Mining (ICDM-05).
Bootstrapping Domain Ontology for Semantic Web Services from Source Web Sites, W. Wu, A. Doan, C. Yu, and W. Meng. In Proc. of the VLDB-05 Workshop on Technologies for E-Services.
Learning from the Web to Match Deep-Web Query Interfaces, W. Wu, A. Doan, C. Yu. ICDE-06. PPT slides.
Workshops, Special Isses, Surveys, Textbook Chapters
The Proceedings of the Semantic Integration Workshop at ISWC-03, edited by A. Doan, A. Halevy, and N. Noy.
Report on the Semantic Integration Workshop at the 2nd Int. Semantic Web Conf. (ISWC-03), A. Doan, A. Halevy, and N. Noy. SIGMOD Record, 33(1):138-140, 2004. A related version appeared in AI Magazine, Spring 2004.
Special Issue on Semantic Integration, A. Doan, N. Noy, A. Halevy (editors). ACM SIGMOD Record, 33(4), 2004.
Special Issue on Semantic Integration, N. Noy, A. Doan, A. Halevy (editors). AI Magazine, Spring 2005.
Semantic Integration Research in the Database Community: A Brief Survey, A. Doan and A. Halevy. AI Magazine, Special Issue on Semantic Integration, Spring 2005.
Chapter 5: Schema Matching and Mapping, in Principles of Data Integration, A. Doan, A. Halevy, Z. Ives, Morgan Kaufmann, 2012.
Others
Proposal to do privacy-preserving schema matching: Privacy Preserving Data Integration and Sharing, C. Clifton, A. Doan, A. Elmagarmid, M. Kantarcioglu, G. Schadow, D. Suciu, and J. Vaidya. Proc. of the 9th Int. Workshop on Data Mining and Knowledge Discovery (DMKD-04).
How to maintain the discovered semantic mappings over time (also related to the wrapper maintenance problem)? Maveric: Mapping Maintenance for Data Integration Systems, R. McCann, B. AlShelbi, Q. Le, H. Nguyen, L. Vu, A. Doan. VLDB-05.PPT slides.
How to exploit a corpus of schemas to match two schemas: Corpus-based Schema Matching, J. Madhavan, P. Bernstein, A. Doan, A. Halevy. ICDE-05.
Tuning matching software: how to select the right component to be executed and correctly adjust their numerous ``knobs'' (e.g., thresholds, formula coefficients): eTuner: Tuning Schema Matching Software Using Synthetic Scenarios, Y. Lee, M. Sayyadian, A. Doan, A. Rosenthal. VLDB Journal Special Issue, Best Papers of VLDB-05. 2006. Invited. (An earlier paper: eTuner: Tuning Schema Matching Software Using Synthetic Scenarios, M. Sayyadian, Y. Lee, A. Doan, A. Rosenthal. VLDB-05. PPT slides.)
How to do keyword search across multiple RDBMSs? First we must do schema matching. Efficient Keyword Search across Heterogeneous Relational Databases, M. Sayyadian, H. LeKhac, A. Doan, L. Gravano. ICDE-07.
Designing schemas for interoperability: If a schema will often be matched against in the future, how can we design it in a way that helps schema matching? Analyzing and Revising Data Integration Schemas to Improve Their Matchability, X. Chai, M. Sayyadian, A. Doan, A. Rosenthal, L. Seligman. VLDB-08.
Selected Talk Slides
Learning to Map between Structured Representations of Data. @ UIUC, 2002, job talk.
Schema & Ontology Matching: Current Research Directions. Univ. of Southern California, 2004.