Building LampLight

It starts with an idea...

Have you ever heard a preacher say, "your bible says ______, but in the original <insert language here> that really means ______." I have experienced this countless times. I think to myself- what else is hiding in the original meaning? Am I missing the point of what I'm reading? Am I studying the Bible incorrectly? An experienced bible scholar once said,

"To know what something means, we first need to know what it meant"

So to find all the context I need do I need to learn Hebrew and Greek? Perhaps live in the regions where the text was originally written to pick up cultural subtext? Can I leverage other resources to gain some insight? Bible translators have done some of the hard work already by taking a concept X and translating it multiple times throughout the text. For example, the English word "ark" actually represents 2 different Hebrew words in the Old Testament. Both mean "ark", but the original scrolls used different Hebrew constructions to be more specific. One of the Hebrew words means: a rigid receptacle often with a cover (think Ark of the Covenant.) The other Hebrew word means: a floating container for holding items (think Noah's Ark.) Here's where it gets interesting. The basket that Moses floated in as a baby (Exodus 2:3) is the exact same Hebrew word that was used for Noah's Ark (Genesis 6:14.) Noah's family was being rescued from the flood, and Moses life would be used to rescue God's people from Egyptian slavery. The similarities were not an accident. The Holy Spirit has built scripture in an amazing way- full of codes, patterns, symbolisms, and treasures!

We can notice that the same Hebrew word is translated to different English words, and the opposite is also true. So what I need to do is cross-reference English translations from my favorite bible version (NASB) with the original Greek/Hebrew. So I can jump between native words and my own language, and vice-versa. That would allow me to expose places in the original text meant to connect ideas, but translators decoupled the ideas so they could make the Bible more readable and use more natural language.

A concordance is a resource used to find patterns and similarities in Bible verses. You can look up a word, say "rock", and find all the verses that use that same word. I would like to do a similar thing, but also across languages.

the English word "ark" actually represents 2 different Hebrew words with slightly different meanings

Decisions, decisions...

MySQL

PROs

  • Comes standard with hosting usually
  • Industry standard
  • Lots of docs

CONs

  • Missing features (until recently)
  • Some configuration requires ROOT

Postgres

PROs

  • Advanced features
  • I wanted to learn this for personal development

CONs

  • More expensive usually with hosting package
  • Some configuration requires ROOT

SQLite

PROs

  • Highly configurable
  • In-memory features

CONs

  • Less community
  • Less documentation
  • Less features

GoDaddy

PROs

  • I already had a site with them
  • Good support
  • Reliable uptime

CONs

  • Old python version
  • Limited DB tech
  • Expensive after 1 year

PythonAnywhere

PROs

  • Super cheap
  • New Python versions
  • Active developer community
  • Great dev portal

CONs

  • Security
  • Postgres costs more
  • Limited storage
  • Performance

DigitalOcean

PROs

  • Droplet/Kubernetes architecture
  • Postgres
  • Trusted performance/scale

CONs

  • Expensive
  • More complicated setup

Database technology and hosting choices go hand-in-hand because they need to integrate well. This meant I was stuck with MySQL unless I wanted to pay more. Over time, I realized that the advanced Postgres features were doable in MySQL with a bit more effort- so this didn't really create any barriers for me.

The low cost of python anywhere and access to great dev tools was an attractive option

In the beginning, there was Data...

Bible.org API Token

If the YouVersion public API was available I would have used that (it was released a few months later.) Instead, I found a public API on bible.org that gave me free access to NASB content. I called this API on all verses and stored them in my first database table. The point of storing is to reduce reliance on another ecosystem that could introduce latency. Once I have it in a table, I can join with the table and build some complicated queries to access data at runtime.

req = requests.get("https://bibles.org/chapters/eng-NASB:"+
                   short_bookname + "." + str(chapter)+".js",
                   auth=(os.environ['BIBLES_ORG_API_KEY'],
                         os.environ['BIBLES_ORG_API_PASS']))
j = json.loads(req.text)
tree = html.fromstring(j['chapters']['0']['chapter']['text'])

verses = tree.xpath('//div[@class="chapter"]'
                    '//span[starts-with(@class, "v")]')

for v in verses:
    verse = v.text_content()

Concordance Scraping

I was unable to find a public API to access concordance and interlinear data. So I decided to web scrape the info I needed from biblehub.com. XPath syntax is great for parsing tree-oriented content like HTML and gleaning structured info. This functionality is available in python with lxml. Here is what that looks like:


from lxml import html
import requests

page = requests.get(f"http://biblehub.com/{lang}/{concord_id}.htm")
tree = html.fromstring(page.content)

translation = tree.xpath('//div[@id="leftbox"]'
                         '/div[@class="padleft"]/span[@class="hdg"]'
                         '/text()[starts-with(., "NASB Translation")]'
                         '/ancestor::span'
                         '/following-sibling::text()[1]')

Interlinear Scraping

This is the part that connects English to Greek/Hebrew and allows us also to connect to a concordance ID (along the top of the image above.) Strong's concordance has a universal ID numbering system that makes correlation really easy. I can use the English here to populate a Word table.


from lxml import html
import requests

page = requests.get(f"http://biblehub.com/interlinear/"
                            f"{book_name}/"
                            f"{row['chapter']}-{row['verse_num']}.htm")
page.encoding = 'utf-8'
tree = html.fromstring(page.content)

word_root = tree.xpath('//table[@class="tablefloatheb"]')
for elem in word_root:
      concord = elem.xpath('.//span[@class="strongs"]/a/text()')
      eng_text = elem.xpath('.//span[@class="eng"]/text()')

Phrase Algorithm

I want the ability to search by phrases, as well as, words. There is a lot of great info found by searching things like "the day of the Lord", "Holy Spirit", etc. So I want to loop though every sentence in the Bible and find every phrase. When I track phrases, I will then keep track of which chapter/verse they came from so I can see where the phrase occurs. By tracking the chapter/verse I can also keep a count of how frequent they are. The challenge of doing this is the following:

  1. There are a LOT of phrases (I end up finding 3.1 million)
  2. Doing this all in memory requires a LOT of memory
  3. Some phrases are pointless (e.g. "In the")
  4. Some verses end in the middle of a sentence (so the phrase can continue between verses)

The way I went about doing this is I kept appending verses until I reached the end of a sentence (which solves #4.) At the end of a sentence, I chop up all the fragments of the sentence by commas and parenthesis and things like that. I generated fragments in an iterative way. For example, "the quick brown fox" turns into: "the quick brown", "the quick", "quick brown fox", "quick brown", "brown fox." Note, the fragment must contain a minimum of 2 words. I adapted the algorithm to throw out overly simple fragments like "the quick": this is not useful since "quick" is already stored as a word. This required me to write a function that determines if a word is meaningless and should count towards the 2 word minimum (excluded words like "a", "in", "the", "at", etc.)

I created a pickled file of a giant hash map, so when my record grew larger than 2500 records I could merge it into the master file. I kept repeating this process until I got through the whole Bible. Then I had a frequency map of all common phrases! See the actual code below.

def merge_into(src_dict, file_name):
    if not os.path.exists(file_name) or os.path.getsize(file_name) <= 0:
        pickle.dump( src_dict.copy(), open( file_name, "wb" ) )
    else:
        with open(file_name, 'rb') as file_object:
            content = pickle.load(file_object)

            for phrase, vrs_tree in src_dict.items():
                for book, chap in vrs_tree.items():
                    for chp_num, verses in chap.items():
                        if book not in content[phrase]:
                            content[phrase][book] = {}
                        book_dict = content[phrase][book]
                        if chp_num not in book_dict:
                            book_dict[chp_num] = verses
                        else:
                            book_dict[chp_num] = list(set(book_dict[chp_num] + verses))

        pickle.dump( content, open( file_name, "wb" ) )


def find_fragments(db, db_row, primary_dict):
    table = str.maketrans({key: None for key in string.punctuation})
    fixed_str = unicodedata.normalize('NFKD', db_row['text']).encode(
        'ascii', 'ignore').decode()

    num_found = 0
    for part in fixed_str.split('"'):
        for subpart in re.split('[!0-9]*,[!0-9]*', part):
            spacedparts = [str(e.translate(table)) for e in subpart.split(" ") if e]
            num_spacedparts = len(spacedparts)
            if num_spacedparts >= 2:
                for idx in range(num_spacedparts-1):
                    for num_words in range(2, num_spacedparts - idx):
                        
                        # make sure the word is meaningful and not overly simple
                        if is_meaningful(spacedparts[idx:idx+num_words]):
                            phrase = (" ".join(spacedparts[idx:idx+num_words])).strip()
                            
                            # lowercase so we don't care about case-sensitivity
                            ltr = phrase[0].lower()

                            # this is the O(1) insertion to the hash_map
                            if ltr not in primary_dict:
                                primary_dict[ltr] = defaultdict(dict)

                            # track where the phrase came from
                            str_book = str(db_row['book_id'])
                            str_chap = str(db_row['chapter'])
                            if str_book not in primary_dict[ltr][phrase]:
                                primary_dict[ltr][phrase][str_book] = {}
                            book_dict = primary_dict[ltr][phrase][str_book]
                            if str_chap not in book_dict:
                                book_dict[str_chap] = []
                            combined = book_dict[str_chap] + [db_row['verse_num']]
                            book_dict[str_chap] = list(set(combined))

                            # too big!  write it to a file!
                            if len(primary_dict[ltr]) > 2500:
                                merge_into(primary_dict[ltr], 'phrases/'+ltr+'.txt')
                                primary_dict[ltr].clear()

                            num_found += 1

    return num_found

Bridging Tables

As mentioned earlier, MySQL is missing some features that would have been nice to have (like storing and querying JSON datatypes.) This funtionality hasn't existed for a long time in the db world, but lately we rely on it like a crutch. There is a way around this: bridging tables.

Without Bridging

WordSchema
{
    varchar(255) word
    int word_id
    json verses
}

With Bridging

WordSchema
{
    varchar(255) word
    int word_id
}

WordVerseBridgeSchema
{
    int word_id // not a primary key, repeated
    int verse_hash // book << 24 | chap << 16 | verse
}

We need bridges between Word <-> Verse, Word <-> Concordance, Concordance <-> Verse, Phrase <-> Verse. One of the downsides is that when any of the tables being bridged changes, we need to regenerate bridging rows. So there is a maintenance cost, but the table structure ends up being a bit simpler this way.

Query time!

You thought we'd never get here! Here is an example of searching for verses that contain the words "hope" and "fear." Note how the bridging tables are used in the join to create links from words to verses.

SELECT DISTINCT Book.title, A.text, A.chapter, A.verse_num
FROM Verse A 
INNER JOIN Verse B ON A.hash = B.hash 
JOIN Book ON A.book_id = Book.book_id 
JOIN WordVerseBridge Avb ON A.hash = Avb.verse_hash 
JOIN Word Aw ON Aw.word_id = Avb.word_id 
JOIN WordVerseBridge Bvb ON B.hash = Bvb.verse_hash 
JOIN Word Bw ON Bw.word_id = Bvb.word_id 
WHERE Aw.word_ci = "hope" AND (Avb.is_match=1) AND Bw.word_ci = "fear" AND (Bvb.is_match=1) 
ORDER BY A.hash ASC 
LIMIT 51 
OFFSET 0

Visualization

All of the data living in tables doesn't help very much. I needed a way to display all of this data and interactively sort and filter the data. Here is the design I ended up with. The page allows you to click on any word in gray and see the native form of that word. Some stats are included about which author(s) most frequently used that native word. The search terms can be added and removed on the fly to generate new searches.

Results

As I have been improving the tool, I have had a lot of success looking up various words and finding linkages in places I did not expect. Here are some gems listed that I was surprised to find.

Ark.........................................................................Same word for Noah's ark and Moses' basket

Eleazar/Eliezer.....................................................Holy Spirit references (Abraham, Moses, Jesus)

Cloud.....................................................................Presence of God

I am.......................................................................Gen 1:3 same word as Ex 3:14

Jesus/Joshua.........................................................Luke 3:29 shows how they are the same natively

Red Sea................................................................."Red" word definition looks a lot more like "Reed"

Love.......................................................................First mention is Gen 22:2 (Abraham & Isaac)

Holy.......................................................................First mention is Ex 3:6 (Burning Bush)

Satan......................................................................Same word is used as "adversary" in other places in Hebrew