Blog

On First Week(Oct. 24. 2018- Oct.29.2018)

  1. Downloaded three movie’s subtitles. There are one English subtitle and two Chinese subtitles for each movie. Merged texts in the same ID into one line. Transformed subtitles into excel files. The script is attached in Git Hub, named LineMerge.py
  2. Preprocessing text (English for tokenization, Chinese for tokenizaiton + word segmentation)
    • Used coreNLP to perform tokenization for English, tokenization and segmentation for Simplified Chinese text from subtitle files
    • Command used for English tokenization: java -cp "*" -Xmx500m edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize -outputFormat text -file text.txt
    • Command used for Simplified Chinese tokenization: java -cp "*" -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators tokenize -outputFormat text -file text.txt
  3. Put them in the DB
    • transformed the tokenization files into csv files
    • imported csv files into mysql database
    • set the number of tokens as primary key