Further preprocessing would likely easily improve performance / reliability of the model:
Empty words - there are many instances of empty feature names such as [__, ____, ...] which don't contribute clearly to meaning, so these should be reduced
Lemmatisation - there being different forms such as [abbreviate, abbreviated, abbreviation, abbreviations] is likely not meaningful for this classification task, so these should be reduced
Stop words - using a stop-word list could also help remove less significant features
Removing other recurring text groups and patterns could prove useful, such as (1) the header of each article, (potentially: 2) the author's information, (3) dates, (4) names
However, despite its early state, it is apparent that the classifier - as currently set up - is already beginning to identify some relevant features for recognising articles discussing ethnolinguistic groups; below, you can review some of our initial sorted positive and negative feature weights for each category:
While the initial classification does appear potentially useful for speeding up annotation efforts (by helping identify potential articles to look at vs. ones to likely skip over), it appears the model is grasping some features that would make the annotation assistance better than chance; however, certainly more improvements are needed to make the model one whose predictions are reliable enough to be used for annotation without review, used effectively in journal search systems, etc.
Fascinatingly, the model may already be beginning to detect concerning patterns present in the data - for example "gay" being in opposition to (a negative indicator of) "African American" language/identity (to explain a bit further, this data/weighting means that where the word "gay" is present, the model is less likely to think that article has to do with African American language / AAVL) - future versions of this model will likely use multi-class approaches, where multiple categories can be true at once, as is the case in the actual data (it is important to note that this current data's prejudice may be strongly influenced by the OneVsRest approach, the model trying to choose *one* category in opposition to others)
This weighting suggests that - across the data - there is a lack of co-occurrence of "gay" and African American language / AAVL-labeled articles.
Also note that some less apparently-racialised features such as "gay" and "women" appear here, as negative features for AAL / AAVL. This is interesting when we note that some of the strongest negative features for AAL / AAVL are somewhat reasonably in opposition, as they are also often racialised / ethnolinguistic terms: [japanese, spanish, indian, french]. This makes (questionably) more sense, though there is also the critique that people can belong to more than one of these groups at a time. Is the model learning to place "gay" and "women" more in opposition to AAL / AAVL:
(1) because we have asked the model to choose OneVsRest?
(2) because we have asked the model to choose among mostly racialised / ethnolinguistic categories, with "LGBTQ" and "women's language" being primary exceptions?
(3) or another reason?
It seems curious that these features (AAL/AAVL and "gay") would be structured in opposition - after all, it is very evident in the world that African-American/Black and gay/queer/non-heteronormative language do frequently overlap; it is an interesting parallel to see emerging: a similar prejudice/inverse-correlation as often expressed by popular narratives within US cultures (e.g. the idea that Black cultures are generally homophobic); it will be very insightful to see what further sociolinguistic patterns emerge in the weights, as the model learns from larger amounts of data
The rest of the categories also reveal interesting perceived correlations - please enjoy reviewing these initial results below!
Additionally, the features which are clearly / likely unhelpful are great indications of next candidates for what to remove - below, I highlighted in yellow some examples that seemed apparent as insignificant from my initial review