Old OCRopus Wiki

Languages‎ > ‎

German

German Language Support for OCRopus

Existing Open Source OCR Systems that Can Handle German

On the examples for GOCR (Screenshots), you can see Umlauts, so it can at least handle the OCR part.

Existing Commercial OCR Systems that Can Handle German

Abbyy http://www.abbyy.com/

Nuance TextBridge http://www.nuance.com/textbridge/

Nuance Omnipage http://www.nuance.com/omnipage/

Ground-Truthed German OCR Data (Scans + Transcription)

German Project Gutenberg is probably a good source but it only contains old texts (> 70 years), and the language has changed quite a bit during that time. So this is probably only a good source to improve the quality of the image -> character data step. When it comes to language model and grammar, this isn't a good source.

Other sources could be Wikisource (geman) or Zeno.org.

German Dictionaries and Text Corpora (for Statistical Language Modeling)

You can try yearly archive DVDs of German magazines (for example, http://www.emedia.de/ from the Heise Verlag).

On Wikipedia is an article about the Institute for Geman Language. Under "Archive", you'll find this:

"Außerdem stellt das IDS das weltweit größte Angebot an deutschsprachigen Textkorpora/ Textsammlungen geschriebener Sprache (umgerechnet fast fünf Millionen Buchseiten) zur Verfügung. Es gibt für diese Sammlung mehrere tausend registrierte Internet-Benutzer im In- und Ausland." (Furthermore, the IDS offers the world's largest archive of Geman text corpora / collections of written texts (around five million pages). Several thousand Internet users are registered to this service worldwide.) 

Since this offers contemporary texts, I think this is much better for creating a language model.

The service is here: https://cosmas2.ids-mannheim.de/cosmas2-web/ (web interface; there is also a Windows app). The usage is free of charge for non-commercial use.

Other Issues 

[other resources? other issues?]