-
-
Save wcaleb/7337097 to your computer and use it in GitHub Desktop.
| #!/bin/sh | |
| # Take a PDF, OCR it, and add OCR Text as background layer to original PDF to make it searchable. | |
| # Hacked together using tips from these websites: | |
| # http://www.jlaundry.com/2012/ocr-a-scanned-pdf-with-tesseract/ | |
| # http://askubuntu.com/questions/27097/how-to-print-a-regular-file-to-pdf-from-command-line | |
| # Dependencies: pdftk, tesseract, imagemagick, enscript, ps2pdf | |
| # Would be nice to use hocr2pdf instead so that the text lines up with the PDF image. | |
| # http://www.exactcode.com/site/open_source/exactimage/hocr2pdf/ | |
| cp $1 $1.bak | |
| pdftk $1 burst output tesspage_%02d.pdf | |
| for file in `ls tesspage*` | |
| do | |
| PAGE=$(basename "$file" .pdf) | |
| # Convert the PDF page into a TIFF file | |
| convert -monochrome -density 600 $file "$PAGE".tif | |
| # OCR the TIFF file and save text to output.txt | |
| tesseract "$PAGE".tif output | |
| # Turn text file outputed by tesseract into a PDF, then put it in background of original page | |
| enscript output.txt -B -o - | ps2pdf - output.pdf && pdftk $file background output.pdf output new-"$file" | |
| # Clean up | |
| rm output* | |
| rm "$file" | |
| rm *.tif | |
| done | |
| pdftk new* cat output $1 |
tesseract can now produce PDF with embedded text directly using the
tesseract input.tif outputbase pdfwhich would create outputbase.pdf
scruss,
Thank you for stating that! That simplifies the process significantly! Plus I now have all the packages on our server needed to convert PDFs to embedded text PDFs. I do not have to go through our IT approval process to get ocrmypdf installed, tesseract can do it.
Thanks!
I would say that the most modern variant is ocrmypdf, which is a nice wrapper above tesseract and is adding some extra features. Its natively in linux repos.
ocrmypdf
That's what I mostly use now. But this gist served me well for years
I would say that the most modern variant is
ocrmypdf, which is a nice wrapper above tesseract and is adding some extra features. Its natively in linux repos.
Available...yes and no. ocrmypdf isn't available on all corporate repos, but tesseract is more available. I ran into this at a former workplace that did a lot of DoD type work and had a pretty restrictive Linux VM. ocrmypdf wasn't readily available, however tesseract was.
tesseract can now produce PDF with embedded text directly using the
PDFconfig option. It's used something like this:which would create outputbase.pdf