Classical V.S. Modern

We found that sentences in classical documents are usually shorter than in modern documents, which can be used as a simple criterion to categorize writing style. In detail, we split each document into sentences, compute the proportion of sentences with a length greater than 10. The histogram for different sets is shown in the following figures:

The histogram of validation and test sets. The blue area shows the histogram of classical documents, while the orange area shows the histogram of modern documents. 0.2 is a rough boundary between classical and modern.

The histogram of the training set. These documents are not annotated as classical or modern. Since the histogram is similar with validation/test sets, we choose the same boundary (0.2) to distinguish the document style if necessary.