Walmart Labs
We has been actively collaborating in research and development with Walmart Labs since 2011. The pace was picked up significantly starting in Summer 2013. Work in the past three years has focused on research challenges in data integration, data analytics, and knowledge graph in Category Development, led by Vijay Raghavendra.
Selected Current Projects
Product Matching
This project examines matching e-commerce products. Significant results in the past year (9/2015-7/2016):
Investigations
examined the quality of training and testing data sets
examined the impact of attribute extraction on matching accuracy
Technical solutions and methodologies
solutions to match certain product departments with very high accuracy
solutions to scale up product matching
solutions to debug product matching
end-to-end methodologies to quickly build a matcher for any product department
Code and software packages
the project has contributed significantly to the following three software packages (on GitHub):
py_stringmatching (to tokenize and compute string similarity scores)
py_stringsimjoin (to quickly match across two large collections of string)
Magellan (an entity-matching management system)
code for scaling up product matching is also available (not yet industrial strength)
Publications
Magellan: Toward Building Entity Matching Management Systems, P. Konda, S. Das, P. Suganthan G.C., A. Doan, A. Ardalan, J. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra, VLDB-2016.
Magellan: Toward Building Entity Matching Management Systems over Data Science Stacks, P. Konda, S. Das, P. Suganthan G.C., A. Doan, A. Ardalan, J. Ballard, H. Li, F. Panahi, H. Zhang, J. Naughton, S. Prasad, G. Krishnan, R. Deep, V. Raghavendra, VLDB-2016 (demonstration).
more work under preparation on scaling up and debugging entity matching.
Human resource training
helped train a large number of graduate students at UW-Madison (and a large number of undergraduate student in Fall 2016) in data science topics.
Data Cleaning
This project focuses on verifying and normalizing the values of product attributes (e.g., converting both "P&G" and "Procter & Gamble" into "Procter and Gamble Corp"). Significant results in the past year include:
Technical solutions and methodologies
A solution to verify attribute values using crowdsourcing, saving up to 60% of cost compared to baseline solutions.
A solution to drastically cut down the time an analyst spent manually normalize attribute values.
A solution that uses automatic clustering together with an analyst's manual effort to normalize attribute values.
Code and software packages
Code has been developed for all three of the above solutions, but not industrial strength as yet.
Publications
Several papers are under preparation.
Fostering a Data Integration/Preparation Eco-System
This project is new.
Selected Past Projects
Generic Rule Management
a solution to quickly detect data cleaning rules that overlap in their coverage.
a solution to help analysts write certain kinds of rules much faster
a solution to index a data corpus for fast rule execution
a solution to scale up the execution of certain kinds of rules over a Hadoop cluster
code has been developed for all of the above solutions.
publications:
"Why Big Data Industrial Systems Need Rules and What We Can Do About It", P. Suganthan G.C., C. Sun, K. Gayatri K., H. Zhang, F. Yang, N. Rampalli, S. Prasad, E. Arcaute, G. Krishnan, R. Deep, V. Raghavendra, A. Doan, SIGMOD-15 (industrial).
Product Classification using Learning, Rules, and Crowdsourcing
participated in the development of a solution to use learning, rules, and crowdsourcing to quickly and accurately classify millions of products into thousands of categories.
developed a solution to quickly generate a large number of classification rules (code was pushed into production).
code was deployed during 2013-2015.
publications:
"Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing", C. Sun, N. Rampalli, F. Yang, A. Doan, VLDB-14 (industrial).
see also the paper "Why Big Data Industrial Systems Need Rules and What We Can Do About It".
Attribute Extraction using Learning, Rules, and Crowdsourcing
a solution that uses machine learning, rules, and crowdsourcing to extract attribute values for products.
code was pushed into production in 2014.
Product Matching using Crowdsourcing
a solution that crowdsources the entire product matching pipeline, with no involvement from a developer ("hands-off crowdsourcing").
publications:
"Corleone: Hands-Off Crowdsourcing for Entity Matching", C. Gokhale, S. Das, A. Doan, J. Naughton, N. Rampalli, J. Shavlik, J. Zhu, SIGMOD-14.
Product Matching using Rules
a solution for lay analysts to quickly develop, debug, and refine product matching rules.
code was developed and deployed from Summer 2014 to Summer 2015.
Selected Earlier Projects
developed topic pages for Walmart.com that show integrated results for certain common topics (2013).
a solution to segment product titles (2013-2014)
participated in the development of solutions to build knowledge graphs from Wikipedia and to use such knowledge graphs to perform entity extraction, linking, classification, and tagging for social media.
participated in the development of a solution to run MapReduce-style processing of fast data (i.e., streaming data).
publications:
"Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach", A. Gattani, D. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. VLDB-13 (industrial).
"Building, Maintaining, and Using Knowledge Bases: A Report from the Trenches", O. Deshpande, D. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, A. Doan. SIGMOD-13 (industrial).
"Muppet: MapReduce-Style Processing of Fast Data", W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, A. Doan. VLDB-12 (industrial).