CCExtractor is an open-source project, led by Carlos Fernandez, that collaborates closely with Red Hen. CCExtractor extracts closed captioning, teletext, and other metadata from television transport streams. See their project page at http://ccextractor.com.
To use the new OCR capabilities, see Abhinav Shukla's GSoC2016 Report.
We should use the latest version, which is on github:
and often not yet on Sourceforge or http://ccextractor.com. It typically has new features we want.
To download it, issue this command in Linux (or Mac):
This command will download the software in a zipped (compressed) format in a file called master.zip. To unzip (decompress) the file, issue
The files will be unzipped into a directory (folder) called ccextractor-master. Rename it to the current version number (which keeps incrementing):
mv ccextractor-master ccextractor_0.84
Walk into the directory:
You'll see the file raspberrypi.md -- read it for the simple instructions to build ccextractor for these devices. Typically, you'll need these:
apt-get install libleptonica-dev libtesseract-dev libcurl4-gnutls-dev tesseract-ocr
You'll also see several subdirectories, including one called "linux" and one called "mac". Walk into the appropriate subdirectory:
You'll see a file called "build". Run it like this:
This compiles (builds) the CCExtractor program; it can take anywhere from a few seconds to a couple of minutes, depending on how fast your computer is.
The build command creates a file that's always called 'ccextractor'. Rename it to track which version you just built:
mv ccextractor ccextractor-0.78
Copy that file into your program directory:
sudo cp ccextractor-0.78 /usr/local/bin
Now test the new ccextractor version (e.g. ccextractor-0.78) for both previous and new functionality. When you are satisfied, then . . .
Walk into your program directory and create a symbolic link to the new version:
sudo ln -sf ccextractor-0.78 ccextractor
In the list of files (ls -l), you should see something like this:
lrwxrwxrwx 1 root staff 16 Oct 2 05:48 ccextractor -> ccextractor-0.78
-rwxr-xr-x 1 root staff 1687840 Oct 2 05:47 ccextractor-0.78
The program is now fully installed.
Locate the teletext
$CX -debug -ts -noru -out=ttxt -utf8 -o $F.ccx.out $F.mpg
Improvement for DVB bitmap captions against a semi-transparent textbox
Abhinav Shukla changed CCextractor 0.85 to improve its recognition of such captions, which are used in e.g. the evening news on FR2.
$ export TESSDATA_PREFIX=/usr/share/tesseract-ocr
$ $CX $FIL.mpg -datets -ttxt -UCLA -noru -utf8 -unixts 0 -delay 1500055200000