NER on Bible- don't do it!

As I mentioned in my comments on Melanie Walsh's textbook, she encourages students to explore the techniques discussed on another text she provides. I decided to use spaCy's Named Entity Recognition pipeline on some biblical texts. It did remarkably poorly, frankly!

I used it to analyze the ASV edition of Exodus, 1 Samuel, Esther, Deuteronomy, Leviticus, and Luke; I also tested the NIV edition of Luke.

Not surprisingly, it failed to identify the tribal names as anything (people, nationalities/political groups, or geographic regions). Most weirdly, in Exodus and Deuteronomy, it failed to identify Moses or Miriam as people even once, although it did identify Aaron, Jethro (but not Reuel), Bezalel, and Zipporah.

It did much better on nations, picking up Egypt/Egyptians, Reuben, Midian, and Amalek. Locations were OK, too, although I think "Israel" in Exodus generally refers to the people, not a region. Similar results were found for Deuteronomy and Leviticus.

SpaCy did much better with 1 Samuel, identifying David, Hannah, Jonathan, and Dagon, but failing to identify Saul as a person. It identified LORD as an organization(!). (I didn't rerun the analysis to see if it did this for Exodus, but it probably did.)

It did even better with Esther, identifying all the major characters (Mordecai, Haman, Esther, King Ahasuerus, and Queen Vashti) and locations (Persia, India, Ethiopia, Media). But it also listed Ahasuerus as a location.

When analyzing Luke, it the translation didn't make much difference. It identified some people (Elizabeth, Joseph, John, Martha, and Peter), but had some mis-identifications as well (Ye, Lo, doth, Sabbath). Place names were generally OK (Jerusalem, Israel, Jordan, Syria, Bethany) but with some mistakes (Gentiles, Beelzebub, Caiaphas).

Most oddly, while it noted (incorrectly) that LORD was something, it didn't idenify "God" as a person or anything else!

I suspect that the materials spaCy has been trained on use "God" mostly as an expletive or exclamation, rather than an actor. Names that are used today (Israel, Syria; Joseph, Martha) are more likely to be correctly identified (but this doesn't explain why Moses and Miriam were never identified as people).

It is also notable that spaCy did much better on stories / histories (1 Samuel, Esther, Luke) than it did on legal texts (Exodus, Leviticus, Deut).

It's pretty clear that folks who want to do NER analysis on the Bible should create different training datasets to ensure more accurate results.

=======

PS I also ran Conan Doyle's Hound of the Baskervilles through the spaCy NER system. It had a much higher success rate: in the top 15 people it identified only one (Baskerville Hall) was not a person; for locations, it had several mistakes (Stapleton, Barrymore, Selden, Holmes). Again, I think it works better on narratives written in a modern style.