German Language Support for OCRopus
Existing Open Source OCR Systems that Can Handle German
On the examples for GOCR (Screenshots), you can see Umlauts, so it can at least handle the OCR part.
Existing Commercial OCR Systems that Can Handle German
Abbyy http://www.abbyy.com/
Nuance TextBridge http://www.nuance.com/textbridge/ Nuance Omnipage http://www.nuance.com/omnipage/
Ground-Truthed German OCR Data (Scans + Transcription)
German Project Gutenberg is probably a good source but it only
contains old texts (> 70 years), and the language has changed quite
a bit during that time. So this is probably only a good source to
improve the quality of the image -> character data step. When it
comes to language model and grammar, this isn't a good source. Other sources could be Wikisource (geman) or Zeno.org.
German Dictionaries and Text Corpora (for Statistical Language Modeling)
You can try yearly archive DVDs of German magazines (for example, http://www.emedia.de/ from the Heise Verlag). On Wikipedia is an article about the Institute for Geman Language. Under "Archive", you'll find this: "Außerdem
stellt das IDS das weltweit größte Angebot an deutschsprachigen
Textkorpora/ Textsammlungen geschriebener Sprache (umgerechnet fast
fünf Millionen Buchseiten) zur Verfügung. Es gibt für diese Sammlung
mehrere tausend registrierte Internet-Benutzer im In- und Ausland."
(Furthermore, the IDS offers the world's largest archive of Geman text
corpora / collections of written texts (around five million pages).
Several thousand Internet users are registered to this service
worldwide.) Since this offers contemporary texts, I think this is much better for creating a language model. The service is here: https://cosmas2.ids-mannheim.de/cosmas2-web/ (web interface; there is also a Windows app). The usage is free of charge for non-commercial use.
Other Issues
[other resources? other issues?] |