KDD 2020 Tutorial

Image and Video Understanding for Recommendation and Spam Detection Systems

Ananth Sankar

Aman Gupta

Sirjan Kafle

Di Wen

Sumit Srivastava

Suhit Sinha

Nikita Gupta

Bharat Jain

Dylan Wang

Liang Zhang

Instructors (in-person) - Aman Gupta (LinkedIn), Sirjan Kafle (LinkedIn), Di Wen (LinkedIn), Ananth Sankar (LinkedIn), Sumit Srivastava (LinkedIn)

Tutors - Dylan Wang (LinkedIn), Suhit Sinha (LinkedIn), Nikita Gupta (LinkedIn), Bharat Jain (LinkedIn), Liang Zhang (LinkedIn)

Image and video-based content has become ever present in a variety of domains like news, entertainment and education. Users typically discover and engage with content via search and recommendation systems. It is also important to serve high quality data to users by filtering out irrelevant or harmful content. Thus, there is an increasing need to leverage the rich information in image and video content in order to power systems for search and recommendation. At the same time, the effectiveness and efficiency of these systems has been accelerated by the availability of large-scale labeled datasets and sophisticated deep learning-based models.

This tutorial is aimed at providing an overview of image and video understanding, and its practical applications in the industry. We focus on deep learning-based state of the art techniques for image and video understanding. This includes tasks like image classification and segmentation, image-based content retrieval and video classification. We also focus on applications of these technologies to large-scale recommendation and low quality content detection systems. We present concrete examples from various LinkedIn production systems, and also discuss associated practical challenges. The tutorial concludes with a discussion on emerging trends and future directions.

Questions? - Contact Aman Gupta at amagupta@linkedin.com

Outline

Introduction (Slides)

Theory

- Image understanding (Slides)
  - Tasks - image classification, object detection, semantic/instance segmentation, visual Q & A, image captioning
  - Image representations
    - Before Deep Learning - HoG, SIFT, VLAD
    - Deep Learning and CNNs
    - Self-supervised learning
    - Optimization for CNNs - implicit regularization for SGD, double descent, flooding
    - Image embeddings
    - Metric learning for images
    - Visio-lingual representations
- Video understanding (Slides)
  - Tasks - video classification, action recognition, temporal topic localization, video captioning
  - Video embeddings and networks
    - Before Deep Learning - SIFT, Fisher Vectors, Optical Flow
    - 3D CNNs
    - Two-stream networks
    - Improvements on 3D CNNs and Two-stream
    - Non-local networks and SlowFast
    - Self-supervised video embeddings

DSGMM and deep cluster-and-aggregate method

- - Speech technologies for video understanding

Applications (Slides)

- Introduction - feed, ads, search and spam
- Multimedia Infrastructure @ LinkedIn
- Multimedia Search @ LinkedIn
- Common technologies for feed and ads recommendation @ LinkedIn
  - Video representations used in production
- Feed recommendation @ LinkedIn
- Ads recommendation @ LinkedIn
- Spam and low quality content detection @ LinkedIn