Description An important development related to the Comparative Agendas Project is the introduction of automated text classification tools, otherwise known as supervised learning systems. The goal of this research agenda is to substantially lower the costs of topic classifying large numbers of events, including data updates, while maintaining high levels of accuracy and intercoder reliability. Supervised learning systems mimic the decisions of trained human coders. Research to date indicates that this approach can reduce the number of cases to be manually coded by 70-80%. Results vary depending on the nature of the data and the size of the training sample. Our methodology draws on well established algorithms and stemming techniques from the information sciences. The main cost reductions derive from ensemble learning and active learning. When multiple algorithms (an ensemble) make the same topic prediction, human coders can have high confidence that the event has been properly classified. Active learning refers to a human-centered process of identifying cases where the system is not performing as well, and intervening with additional training to reduce or eliminate similar mistakes in future rounds. The readings to the right describe this subject and our approach in more detail. | Text Topics BLOG Text Tools Software (**updated**) Professor Paul Wolfgang (Temple University, wolfgang@temple.edu) has developed software based on the work of Hillard, Purpura and Wilkerson (article below) that enables CAP researchers to apply state of the art supervised learning methods. As of February 2009, the Text Tools have been updated to include stemmers for many languages. The latest version can be downloaded from http://www.cis.temple.edu/~wolfgang/TextTools_v0.8.zip This release corrects an error in the previous language-sensitive version (7.0). The language specific stemming algorithms and stop word lists were developed by
Porter (see http://snowball.tartarus.org).
danish
> dutch
> english
> finnish
> french
> german
> hungarian
> italian
> norwegian
> portuguese
> russian
> spanish
> swedish
Click here for the User Manual and Introduction to Automated Classification Related Readings Hillard, Purpura and Wilkerson, "Automated Text Classification for Mixed Methods Social Science Research" Journal of Information Technology and Politics, June 2008 Cardie and Wilkerson, Text Annotation for Political Science (Editor's Introduction), Journal of Information Technology and Politics, August 2008 |