Chinese Pipeline

Introduction

Red Hen gathers Chinese broadcasts to make datasets for NLP, OCR, audio, and video pipelines. Work on establishing pipelines was launched as part of Red Hen's Google Summer of Code 2018. The broadcasts include TV news from CCTV1, CCTV13, HNTV1, and HNCSTV. We will attempt to use OCR technologies to extract on-screen text. As of December 2018, we have a preliminary Automatic Speech Recognition pipeline in place (see below) but it needs considerable improvement. We will implement some basic natural language processing tasks such as Chinese Word Segmentation, Part-of-Speech tagging, Named-Entity-Recognition, etc. Future project includes exploring data analysis on texts produced by these technologies.

Getting started with Chinese Pipeline

Preliminary Automatic Speech Recognition pipeline in production

See Automatic Speech Recognition on Chinese (github)

Prerequisites

For Audio-only Speech Recognition:
For Audio-Visual Speech Recognition:

In addition to above requirements, you will also require:

OpenCV 3.x for Python
scikit-image
Dlib for Python

For Natural Language Processing:

Installation

wait to be edited

Data-Preprocessing for Training

Audio-only Speech Recognition

Audio-Visual Speech Recognition(AVSR)

Training

Audio-only Model

Audio-Visual Model

Training Results

Page updated

Google Sites

Report abuse