Downloaded three movie’s subtitles. There are one English subtitle and two Chinese subtitles for each movie. Merged texts in the same ID into one line. Transformed subtitles into excel files. The script is attached in Git Hub, named LineMerge.py
Preprocessing text (English for tokenization, Chinese for tokenizaiton + word segmentation)
Used coreNLP to perform tokenization for English, tokenization and segmentation for Simplified Chinese text from subtitle files
Command used for English tokenization: java -cp "*" -Xmx500m edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize -outputFormat text -file text.txt
Command used for Simplified Chinese tokenization: java -cp "*" -mx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -props StanfordCoreNLP-chinese.properties -annotators tokenize -outputFormat text -file text.txt