Credits

plenum.be • The Memory of Parliamentary Democracy

Creating a Digital Archive for the Proceedings of the Belgian Chamber of Representatives, 1870-1940

The development of plenum.be was made possible through a research grant from the University of Antwerp. The project is based on the existing digital archive of the Belgian Parliament's plenary sessions of 1844-1999, which may be found at the Chamber's website dekamer.be // lachambre.be.

The plenary sessions comprise an immense amount of text: in most years, at least 2000 pages of Parliamentary speeches and debates were published. For the first phase of our project, which only covers the period 1870-1940, the corpus totals over 137.000 pages, or 183 million words > see plenum.be statistics

Project Director Marnix Beyen
Research & Testing Kaspar Beelen
Web Development Thomas Crombez



Project Background


The digital archive of 1844-1999 maintained by the Chamber's documentation centre consists of scanned pages from the original annals, which were published in book form. In order to make this immense archive of parliamentary discourse searchable, we first had to process all PDF scans using an OCR application (namely, Abbyy FineReader). OCR technology (Optical Character Recognition) makes it possible to 'read' an image of a printed page and detect blocks of text, which are converted into digital text files.

The quality of the OCR transcriptions is generally good, but far from perfect. In every page minor errors occur, which mostly involve names. In the original publications of the parliamentary proceedings, names are often capitalized or set in a bold font-type, thereby making the letters more difficult to recognize. We aim to improve the quality of the text through manual and automatic corrections as the project progresses.

Once we have PDF facsimiles and text files of every page, the 'site construction' phase begins. This website was not built from scratch -- it uses a web-building tool of Google called Google Sites.

Every page on this website is part of a Google Site. The pages show both the OCR'ed text and the PDF facsimile of the original page. The PDF files are presented using another useful tool, namely the PDF viewer that has recently been integrated into Google Docs.

Why do we use Google Sites? Despite the severe limitations that Google Sites places on design and scripting, the application had a particular set of features that were attractive to this kind of project. First, there is no limit on the number of HTML pages you can create in a single Google Site, meaning we could post the massive amount of text online without worrying about the disk limits of our own (or hosted) servers. Second, every Google Site is fully indexed and searchable (using the search box on top of the page), which freed us from having to develop and host the enormous search index ourselves.

Still, it would have been impossible to post thousands of pages without a way of automating this process. This key feature became available as the Google Data APIs were released, enabling programmatic access to many of Google's applications. We choose the gdata Python client library (code.google.com/p/gdata-python-client) to upload all of our HTML pages to the Google Site.