Chinese Pipeline

Introduction

Red Hen gathers Chinese broadcasts to make datasets for NLP, OCR, audio, and video pipelines. Work on establishing pipelines was launched as part of Red Hen's Google Summer of Code 2018. The broadcasts include TV news from CCTV1, CCTV13, HNTV1, and so on. We will attempt to use OCR technologies to extract on-screen text. We will attempt to use Deep Learning technologies (e.g. DeepSpeech2, from Baidu) for Automatic Speech Recognition. As of December 2018, we have a preliminary Automatic Speech Recognition pipeline in place (see below) but it needs considerable improvement. We will implement some basic natural language processing tasks such as Chinese Word Segmentation, Part-of-Speech tagging, Named-Entity-Recognition, etc.  Future project includes exploring data analysis on texts produced by these technologies. 

Preliminary Automatic Speech Recognition pipeline in production

Red Hen has a Singularity container in production at the Case HPC that runs Chinese ASR using Baidu's DeepSpeech2 with PaddlePaddle inside a Singularity container built on Singularity Hub from a recipe. It starts with this command:

singularity exec -e --nv ../Chinese_Pipeline.simg bash infer.sh $DAY

In the Slurm job submission, it requests a GPU:

#SBATCH -p gpu -C gpuk40 --mem=100gb --gres=gpu:2 

abc123@server:~/cp$ squeue -u abc123
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          12267389       gpu work.slu   abc123 PD       0:00      1 (Priority)
          12267373       gpu work.slu   abc123  R      27:51      1 gput025
          12267379       gpu work.slu   abc123  R      15:41      1 gput026

It takes about four minutes to run ASR on a standard one-hour recording. 

Pending tasks: Chinese Red Hens report that the output makes sense, but has copious errors and disfluencies; to improve it, the audio should be cut at pauses or in word breaks rather than mechanically at ten-second intervals. A news content training dataset would also help. 

Related pages

Resources

  • THCHS-30(A Free Chinese Speech Corpus Released by CSLT@Tsinghua University)
  • DeepSpeech (A TensorFlow implementation of Baidu's DeepSpeech architecture)

Getting started with Chinese Pipeline

Prerequisite

  • For Audio-only Speech Recognition:
    • Git large file storage
    • Tensorflow 1.0 or above
    • Scipy
    • PyXDG
    • python_speech_features
    • python_soxs
    • pandas
    • FFmpeg
  • For Audio-Visual Speech Recognition:
In addition to above requirements, you will also require:
    • OpenCV 3.x for Python
    • scikit-image
    • Dlib for Python
  • For Natural Language Processing:
    •  jieba
    •  pyltp

Installation

  • wait to be edited

Data-Preprocessing for Training 

Audio-only Speech Recognition

Audio-Visual Speech Recognition(AVSR)

Training

Audio-only Model

Audio-Visual Model

Training Results