The Impact of Preprocessing on the Classification of Mental Disorders

Yaakov HaCohen-Kerner (Lev Academic Center), Yair Yigal, Daniel Miller

Hundreds of millions of people worldwide suffer from various mental disorders. Recent studies have shown that using text classification models, some of the disorders can be identified by intelligent analysis of the texts written by these people. Another related domain is the automatic classification of documents according to their authors’ mental state, using supervised mental medical datasets. It is well known that many text classification applications perform various types of preprocessing, e.g., conversion of uppercase letters into lowercase letters, HTML object removal, stopword removal, punctuation mark removal, lemmatization, correction of commonly misspelled words, and reduction of replicated characters. We hypothesize that the application of several specific combinations of preprocessing methods can improve TC results. In this study, we explore the impact of all possible combinations of six basic preprocessing types on text classification of mental disorders. We tested these combinations over three supervised mental medical datasets. In the larger dataset, the best result showed a significant improvement of about 28% over the baseline result using all the six preprocessing methods. In the two other datasets, several combinations of preprocessing methods showed minimal improvement rates over the baseline result. Possible explanations for the small improvements in the two other datasets might be that they are too small to enable significant learning and that the larger dataset includes a much higher number of spelling mistakes, which increases the success of the preprocessing method that deals with correction of commonly misspelled words.