City Directories
City Directories
I am currently working on the digitization of German city directories (Adressbücher) using Optical Character Recognition (OCR) to speed this cumbersome process up. See Albers and Kappner (2022) for a discussion. City directories provide us with personal information (name, occupation, exact address, telephone number, floor number for a few city directories) about the household head, as well as business/amenity locations within the city over 200 years at a nearly annual frequency. Occupations can be mapped to income via HISCAM scores (Lambert et al (2013)). You can find the excel file for Düsseldorf 1891, which apart from the current OCR/clean-up pipeline, took about 30 minutes of manual corrections of flagged errors here (0.9mb, done in May 2025) not counting here the 20 minutes of OCR that can be done in the background. Currently, the character error rate achieved for Fraktur is approx. 1 error per page (approx. 0.03% CER), with post-processing able to achieve 0.01%. This is lower than some reliable, not overexaggerated estimates for manual entry of 0.13% of non-Fraktur sources (see the great work/slides by Lin, Moulton, Rand and Smith page 48). However, note that the error rates for Fraktur typeset is about a magnitude higher due to common issues by OCR to distinguish ligatures (u vs n, f vs s, c vs e) which can be partially corrected via post-processing.
I am currently squeezing out every bit of optimization left, with the actual digitization work (scanning + OCR + geocoding) expecting to proceed at the end of 2025.
We can then track historical segregation patterns in income/socio-economic backgrounds through industrialization and rapid urbanization until today, or create an alternative to the recent efforts in measuring historical urban built-up using historical maps (see MapHis project or Combes et al (2020)), but at a much higher frequency.
With my current process, it takes about 20 minutes for OCR + post processing (about 10 minutes single core, but directly scalable by multiple cpu cores) + 30 minutes manual clean up of flagged entries, of the already scanned city directory of Düsseldorf 1891 with very few errors, making it possible to mass geocode it for empirical work. These consists of 33k entries and the process was done in the background. However the scanning of a city directory would take another 40 minutes for this directory of a rather large city. At this point, these errors occur almost always due to non-optimal scans of available directories online, mostly due to 'shadows' from the curvature of the book and inappropriate lighting. Hence, manual labor would amount to about 1 hour depending on the size of the city and the quality of these pre-existing scans, but lower if the scans were done with optimal OCR in mind. In comparison, it took about 7 volunteers of the Verein für Computergenealogie 3 months to manually input the Bautzen 1938 city directory with 22k entries.
We can exploit the panel structure to further reduce the errors particularly in the very varied last name, using the cross-time nature to cross verify the last name over time and further reduce 'typos'.
Currently (June 2025), this pipeline is being optimized for lower character error rates through post-processing of very common errors (impossible character combinations, common issues with s/f, or n/u in commonly found words), targeting a CER of 0.01% (currently about 0.03%). The OCR pipeline has also been optimized for the rare case of city directories (seemingly between 1900-1945, and only in 'province capitals' as far as I can tell) reporting the exact floor of the household with I's, which causes issues in the OCR in distinguishing between 1's and I's (the floor 'number'), and may lead to an error in the geocoding afterwards. So far, the wrong address number in the Düsseldorf 1898 directory caused by this kind of error has been found for an estimated 20 entries with post-processing (out of around 43k entries, so around 0.046%). This rate has been achieved through a custom-made training set + dictionary (i.e. we do not include for example too high street numbers such as 831 in the dictionary, which is more likely to be 83I, that is, street number 83, floor I) That is, for around 0.046% of entries, we may miss the correct location by a few meters to a few hundred meters for (seemingly) province capital city directories between 1900-1945, and only if they use I's to signal floor number (city directories frequently change it up with other types for no reason, but that does not cause this type of distinguishing problem). However, these errors can be further reduced by exploiting the panel nature of city directories and cross-matching across time. (July 2025) Turns out this error rate can be vastly reduced to almost 0% through selecting the correct DPI.