PDF on Linux tips
[Jan. 2020]
Occasionally I find that I need to massage PDF files in some way. Since I use Ubuntu as my desktop operating system, I need to find software and methods that will work with Linux (and are free!). There is actually a large selection of Linux tools for this, both command line and GUI type (although the command line tools tend to be the most useful). Since some tools work better than others, or work better on particular PDF types, I plan to keep a record here of the tools I find most useful. As the need arises, I will include entries on these topics:
Bursting - converting a multi-page PDF to single PDF pages
Compression - reducing PDF file size
Enhancement - improving contrast, removing speckles, etc.
Image to PDF
Merge - Multiple PDF files to a multi-page PDF
OCR
Password removal
PDF to Image
Splitting - converting 2-up pages to 1-up pages
Watermarks - removing and adding
Bursting
Compression
There are a variety of Linux tools available to reduce the file size of a pdf:
ps2pdf :
[command line] ps2pdf LARGE.pdf SMALL.pdf
[command line] pdf2ps large.pdf - | ps2pdf - small.pdf
qpdf : [command line] qpdf --linearize input.pdf output.pdf
[command line] gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=out.pdf in.pdf
see also : [command line] gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET
ImageMagick : [command line] convert -density 200 -compress jpeg -quality 20 test.pdf test2.pdf
Enhancement
There are a variety of options for enhancement:
gIMP image editor
Libreoffice Draw
Command line
gIMP image editor
This is really an option of last resort as it is very time consuming, but it is a very powerful tool for major issues. Explaining the full scope of using gIMP would require a book, but the general process is to convert the PDF page (or pages) that need repair to images, and then editing the images. Depending upon the level of effort put in, this can include straightening out skewed pages, removing scribbles, blots, etc., filling in obvious gaps in drawings and images, improving contrast, and so on. Once the image has been edited, it is converted back into a PDF and re-integrated with the full document.
LibreOffice Draw
PDF files can be opened in Libreoffice Draw, with mixed results. It is worth trying, as in some cases it works very well. However, with some PDF files the result is scrambled pages, missing images, etc.
Command Line
[to be added later]
Image to PDF
https://askubuntu.com/questions/493584/convert-images-to-pdf
Command line : convert "*.{png,jpeg}" -quality 100 outfile.pdf
Merge
See this link : https://stackoverflow.com/questions/2507766/merge-convert-multiple-pdf-files-into-one-pdf#11280219
pdfunite in-1.pdf in-2.pdf in-n.pdf out.pdf
OCR
It's very common to find PDF files which are scanned images of pages; since the text is an image it cannot be searched or cut&pasted. Fortunately there is OCR software which can create an "image over text" copy of the original PDF file. I have found three useful tools for this purpose. Note that all of these tools require the tesseract OCR engine - this can be installed form the Ubuntu software repository if you have not already done so. Here are some OCR options (see also this page):
pdfocr : https://raw.githubusercontent.com/gkovacs/pdfocr/master/pdfocr.rb
More explanation here : https://ubuntuforums.org/showthread.php?t=1456756
This is a script which you run from the command line; I have found this produces relatively fast and clean copies.
Change one line in the script from
use_tesseract = false
to
use_tesseract = true
Run the script from the command line as:
pdfocr -i input.pdf -o output.pdf
gscan2pdf : This is GUI based. It works, but I found it slow, and in some cases it produces PDFs with washed out or missing images.
Install this from the Ubuntu software repository
pdfsandwich : http://www.tobias-elze.de/pdfsandwich/
This works, but also sometimes produces results with poor image pages.
This suggestion to use the -nopreproc option helped me get better results : https://www.onetransistor.eu/2015/12/ocr-searchable-pdf-linux.html
e.g [command line] pdfsandwich -nopreproc in.pdf
(results in in_ocr.pdf generated)
Password Removal
Occasionally I will run across a PDF file which permits reading, but is otherwise password protected (why!? annoying and pointless!). In such cases I have found that simply opening up the PDF in a PDF reader and then printing it to a file creates a copy with no password. Printing to a PDF file is built into Ubuntu Linux printing.
PDF to Image
For a single page PDF use GIMP. For a scanned PDF this also provides an opportunity to clean up or enhance the image before exporting to an image file.
For a multi-page PDF :
pdftoppm {input.pdf} {output.file} -png
See this link : https://www.cyberciti.biz/faq/how-to-convert-pdf-to-image-on-linux-command-line/
For a page range :
convert -density 150 input.pdf[119-121] -scene 10 -quality 100 output-%d.png
In the example above, pages 199-121 will be output as :
output-10.png
output-11.png
output-12.png