Master Address File

Trading time for space

A nice feature of B-TREE-P is that only item identifiers (and some pointers) are actually stored in the B-tree file, regardless of the length of the logical keys specified in the BTPKEY routine. As a result, very little disk space is used up by the B-trees. Indexing 100,000 records by a five-digit ZIP code from one attribute takes the same amount of disk space as indexing those records by some 40-character string from another field.

The penalty for using minimum disk space in B-TREE-P comes in the form of the additional time required to access the logical data file and construct the logical key whenever a B-tree is traversed. For example, calling BTPINS to index a new customer record by ZIP code might require comparing the new key with ten or twenty logical keys stored in the B-tree. BTPINS accomplishes that by extracting item identifiers from the B-tree, using the identifiers to read the actual customer records from the customer file, and calling BTPKEY to convert the item identifiers into logical ZIP code keys. Conversion of item identifiers to logical keys, and all the time necessary for the accompanying disk reads, occurs with every call to BTPINS, BTPDEL, and BTPFIND.

Installations that have plenty of disk space may instead prefer to keep complete, actual keys in B-trees instead of just item identifiers. That requires more disk space, but avoids any disk reads of the logical data file, and allows faster execution. Fortunately, it is easy to modify B-TREE-P subroutines to do just that.

To create a version of BTPINS that stores complete keys in the B-tree file and avoids reading the logical data file, change the variable IN to KEY in lines 18 and 19, and change the variable UID to FK in lines 19, 35, 59, and 78. Then delete lines 23 and 24. Note that the DFILE file variable parameter is no longer used anywhere in BTPINS, since accessing the data file is no longer required.

As a benchmark on a small Pick system, the BUILD program was used to create ZIP, COMP, and LNAME B-trees for 1,000 NAMES items. With the original BTPINS and a node size of 5, BUILD required just under 22 minutes. With the new BTPINS, BUILD required just under 12 minutes, or a little more than half as much time, because reads of the NAMES file were no longer being done by BTPINS. However, the B-TREE file grew from 23,729 bytes under the original BTPINS to 124,461 bytes with the new BTPINS, because long logical keys are now stored there, instead of short item identifiers.

As a second test, BUILD was run again, with a node size of 50. With the original BTPINS, total time required for three B-trees was 20 minutes 39 seconds, a slight improvement over a node size of 5. The B-TREE file also dropped to 16,015 bytes, because larger nodes mean fewer total nodes and pointers. With the new BTPINS and a node size of 50, BUILD took over 17 minutes, so increasing the node size can backfire and slow down B-tree manipulation when long logical keys are stored in the node. A better node size for our file and the new BTPINS is 10, which causes essentially no increase in BUILD time, but yields a smaller B-TREE file because of fewer nodes.

If you decide to modify BTPINS and use more disk space in exchange for faster B-tree inserts, you'll also have to change BTPFIND so that it too works with actual keys instead of only item identifiers in the B-tree nodes. To create a new version of BTPFIND that works with the new BTPINS, change UID to KEY in line 9, change line 10 to IF KEY = FK THEN ID = KEY ; RETURN, and then delete lines 11 and 12.

Like the modified version of BTPINS, the DFILE file variable parameter is no longer used anywhere in the new BTPFIND, since accessing the data file is no longer required. As a result, BTPFIND runs much faster. For example, one benchmark using the original BTPFIND to find 1,000 items in a B-tree built with the original BTPINS required 6 minutes 40 seconds. With the new BTPINS and BTPFIND, finding 1,000 items in the B-tree took only 3 minutes 15 seconds.

Once BTPINS and BTPFIND are modified, the keys retrieved from a B-tree by BTPFIND and BTPSEQ are not just item identifiers, but complete logical keys as constructed by BTPKEY. For example, after a call to BTPFIND to search the LNAME B-tree, the ID variable will return a string like "SMITHnJOHNn376", which is a last name, first name, and item identifier as actually concatenated and stored in some B-tree node (the n indicates the null character). Similarly, BTPSEQ will return the same kind of complete key in its ADJKEY variable.

To take a complete logical key returned by BTPSEQ or the new BTPFIND and truncate it to just the item identifier (which should always be the last null-delimited field in the key), use a routine that is the opposite operation of BTPKEY:

SUBROUTINE BTPUNKEY(ROOT, KEY)

EQU nul TO CHAR(0)

BEGIN CASE

CASE ROOT = "ZIP"

  KEY = TRIM(FIELD(KEY,nul,6))

CASE ROOT = "COMP"

  KEY = TRIM(FIELD(KEY,nul,4))

CASE ROOT = "LNAME"

  KEY = TRIM(FIELD(KEY,nul,3))

CASE 1 ; STOP "BTP7"

END CASE

RETURN

END

Note that BROWSE.NAMES won't work correctly after BTPINS and BTPFIND have been modified, because BROWSE.NAMES expects BTPFIND and BTPSEQ to return simple item identifiers, but now those routines return complete logical keys.

One way to fix BROWSE.NAMES and make it compatible with the new BTPINS and BTPFIND is to call BTPUNKEY after each call to BTPSEQ and BTPFIND, to convert a long logical key to an item identifier. For example, insert the statement CALL BTPUNKEY(ROOT, CURR.ID) after the call to BTPFIND in line 40, and CALL BTPUNKEY(ROOT, DUMMY.ID) after the call to BTPFIND in line 174. Similarly, insert the statement CALL BTPUNKEY(ROOT, PREV.ID) after the calls to BTPSEQ in lines 85 and 124, and insert CALL BTPUNKEY(ROOT, NEXT.ID) after the calls to BTPSEQ in lines 90, 107, 132, and 153. Also insert CALL BTPUNKEY(ROOT, CURR.ID) after line 34 of BROWSE.NAMES, since that statement extracts an initial key directly from a B-tree node.

Note that by calling BTPUNKEY only from BROWSE.NAMES, the BTPSEQ subroutine never has to be modified. BTPSEQ could be modified to automatically call BTPUNKEY and truncate ADJKEY before each RETURN in lines 16, 21, 26, and 41, so that BROWSE.NAMES wouldn't have to follow each call to BTPSEQ with a call to BTPUNKEY, but then BROWSE.NAMES would still have to somehow pass the ROOT parameter to BTPSEQ.

Unlike BTPSEQ, BTPFIND already has access to the ROOT parameter, so BTPFIND could easily be modified to call BTPUNKEY and truncate ID before each RETURN in lines 10 and 25, so BROWSE.NAMES wouldn't have to follow each call to BTPFIND with a call to BTPUNKEY. However, this discussion will leave BTPSEQ unmodified for simplicity, and BTPFIND will also not call BTPUNKEY, so that BTPFIND returns the same complete key as BTPSEQ will. That leaves all BTPUNKEY calls in BROWSE.NAMES. Even if BTPSEQ and BTPFIND were to automatically call BTPUNKEY before they return, BROWSE.NAMES still has to have at least one BTPUNKEY call after the initial key extraction in line 34.

Modifying the original BTPINS, BTPFIND, and BROWSE.NAMES source code as described above creates a demonstration system that lets you build and browse B-trees containing complete keys instead of just item identifiers. Although more disk space is used for the longer keys, the insert and find routines run much faster. A final modification to BTPDEL makes it compatible with the new routines and increases the speed of B-tree deletes: change SID to KEY in lines 10 and 11, and change ID to FK in lines 11 and 32. Then delete lines 16 and 17.