Walmart Labs

We has been actively collaborating in research and development with Walmart Labs since 2011. The pace was picked up significantly starting in Summer 2013. Work in the past three years has focused on research challenges in data integration, data analytics, and knowledge graph in Category Development, led by Vijay Raghavendra.

Selected Current Projects

Product Matching

This project examines matching e-commerce products. Significant results in the past year (9/2015-7/2016):

  • Investigations

      • examined the quality of training and testing data sets

      • examined the impact of attribute extraction on matching accuracy

  • Technical solutions and methodologies

      • solutions to match certain product departments with very high accuracy

      • solutions to scale up product matching

      • solutions to debug product matching

      • end-to-end methodologies to quickly build a matcher for any product department

  • Code and software packages

      • the project has contributed significantly to the following three software packages (on GitHub):

      • code for scaling up product matching is also available (not yet industrial strength)

  • Publications

  • Human resource training

      • helped train a large number of graduate students at UW-Madison (and a large number of undergraduate student in Fall 2016) in data science topics.

Data Cleaning

This project focuses on verifying and normalizing the values of product attributes (e.g., converting both "P&G" and "Procter & Gamble" into "Procter and Gamble Corp"). Significant results in the past year include:

  • Technical solutions and methodologies

      • A solution to verify attribute values using crowdsourcing, saving up to 60% of cost compared to baseline solutions.

      • A solution to drastically cut down the time an analyst spent manually normalize attribute values.

      • A solution that uses automatic clustering together with an analyst's manual effort to normalize attribute values.

  • Code and software packages

      • Code has been developed for all three of the above solutions, but not industrial strength as yet.

  • Publications

      • Several papers are under preparation.

Fostering a Data Integration/Preparation Eco-System

This project is new.

Selected Past Projects

Generic Rule Management

    • a solution to quickly detect data cleaning rules that overlap in their coverage.

    • a solution to help analysts write certain kinds of rules much faster

    • a solution to index a data corpus for fast rule execution

    • a solution to scale up the execution of certain kinds of rules over a Hadoop cluster

    • code has been developed for all of the above solutions.

    • publications:

Product Classification using Learning, Rules, and Crowdsourcing

Attribute Extraction using Learning, Rules, and Crowdsourcing

    • a solution that uses machine learning, rules, and crowdsourcing to extract attribute values for products.

    • code was pushed into production in 2014.

Product Matching using Crowdsourcing

    • a solution that crowdsources the entire product matching pipeline, with no involvement from a developer ("hands-off crowdsourcing").

    • publications:

Product Matching using Rules

    • a solution for lay analysts to quickly develop, debug, and refine product matching rules.

    • code was developed and deployed from Summer 2014 to Summer 2015.

Selected Earlier Projects