Limitations

Data Distribution of Categories

One central limitation was that - as visible in the chart above - the articles were not evenly balanced across each category
- The largest category by far was CAT_1 (AAL / AAVL), which suggested perhaps this should just be a classifier for this category alone -- or that combining CAT_1 and CAT_2 makes sense for an initial classifier
- The next largest categories were CAT_2 (African Americans) and CAT_4 (Native Americans), great candidates for being next labels to integrate into the classifier
- Then, there are categories which have almost no instances:
  - UNK and CAT_0 (negative categories) makes sense, because we excluded most articles which were unclear or did not represent another category - we only included positive instances
  - CAT_3_1 (Latinx / Hispanic Diaspora) had 1 article
  - CAT_4_1 (Indigenous Peoples - World) had 2 articles
- The remaining categories had a small number of articles, not being quite numerous enough for a reliable prediction to be formed

Overall Data Quantity

There were very few articles overall, some of which were quite short in length
This is a major challenge, as the labor of annotating the current articles was already immense, split over years of Dr. Lanehart's and Ayesha's manual annotating - the decision to include more annotated articles, while it would certainly improve quality of the results, would also incur a large amount of further labor, unless clustering techniques were used to automate the process (though this likely would still need review)

Time

another limitation was my ability to spend time on the project changed dramatically, and so recurring challenges (such as setting up a stratified k-fold for cross-validation) did not receive enough time to be resolved and became a major block

Page updated

Google Sites

Report abuse