Classical V.S. Modern
We found that sentences in classical documents are usually shorter than in modern documents, which can be used as a simple criterion to categorize writing style. In detail, we split each document into sentences, compute the proportion of sentences with a length greater than 10. The histogram for different sets is shown in the following figures:
The histogram of validation and test sets. The blue area shows the histogram of classical documents, while the orange area shows the histogram of modern documents. 0.2 is a rough boundary between classical and modern.
The histogram of validation and test sets. The blue area shows the histogram of classical documents, while the orange area shows the histogram of modern documents. 0.2 is a rough boundary between classical and modern.
The histogram of the training set. These documents are not annotated as classical or modern. Since the histogram is similar with validation/test sets, we choose the same boundary (0.2) to distinguish the document style if necessary.
The histogram of the training set. These documents are not annotated as classical or modern. Since the histogram is similar with validation/test sets, we choose the same boundary (0.2) to distinguish the document style if necessary.