Pdf ocr text layer


(Scott Smith) #1

With Linux Mint and a few other Ubuntu derivatives, there are the tools pdfsandwich and ocrmypdf. This will take an image-based PDF, OCR the text from it, and then set the text as a layer under the graphic. End result - searchable PDFs.

Does anyone know of a port, with dependencies, of either of these to CentOS 6 or 7?

I’m currently running NS6, but if necessary in order to get this capability (a requirement for a new project) then I can ignore my reservations about CentOS 7 and upgrade. If I can’t get a CentOS solution, I may have to stand this site up with an Ubuntu server… ugh!


(Markus Neuberger) #2

ocrmypdf could be installed using the python-scl module. I tested on Nethserver 7.6.1810 but it may work for 6 too (if the module doesn’t work you may install rh-python36-pip from sclo repo).

yum install http://mirror.de-labrusse.fr/NethServer/7/x86_64/nethserver-stephdl-1.0.7-1.ns7.sdl.noarch.rpm
yum install nethserver-rh-python36 –enablerepo=stephdl
yum install ghostscript qpdf tesseract tesseract-osd

List and install tesseract language(s) for OCR:

yum list tesseract-langpack-*

Example: yum install tesseract-langpack-deu

Install ocrmypdf

python36 pip install ocrmypdf

To use it:

python36 ocrmypdf input.pdf output.pdf

qpdf prints a security warning but it works:

You are using qpdf version 5.0.1 which has known issues including security vulnerabilities with certain malformed PDFs. Consider upgrading to version 7.0.0 or newer.

https://ocrmypdf.readthedocs.io/en/latest/installation.html#installing-with-python-pip

Pdfsandwich could easily be compiled:

yum install subversion unpaper ocaml
svn checkout svn://svn.code.sf.net/p/pdfsandwich/code/trunk/src pdfsandwich
cd pdfsandwich
./configure
make
make install
cd ~

Usage:

pdfsandwich test.pdf

http://www.tobias-elze.de/pdfsandwich/


(Scott Smith) #3

Thanks for the prompt assistance!

Stuck on this step.

yum install nethserver-rh-python36 --enablerepo=stephdl

Loaded plugins: changelog, fastestmirror, nethserver_events, presto

Setting up Install Process

Loading mirror speeds from cached hostfile

No package nethserver-rh-python36 available.

Error: Nothing to do

I assume it’s an error in the package name, right?


(Markus Neuberger) #4

EDIT:

For NS6 you need to install epel (for the other needed packages) and scl repo (for rh-python36):

yum --enablerepo=extras install centos-release-scl
yum --enablerepo=extras install epel-release
yum install rh-python36-python-pip

Usage like:

/opt/rh/rh-python36/root/usr/bin/pip3.6 install ocrmypdf

/opt/rh/rh-python36/root/usr/bin/ocrmypdf input.pdf output.pdf


(Scott Smith) #5

Okay, I’m still hitting walls. It seems primarily due to NS6.

I think – since I’m no longer designing and supporting 100K+ core hpc systems (where systemd is worse than the wrong solution) – I will have to set aside my dislike of all things systemd and build this one system out with NS7. That will provide easier access to newer packages. These NS-based systems I’ll be dealing with are mostly at the app level rather than the admin/engineer level, so I’ll just pretend I don’t see the systemd crud :slight_smile:

I’ll rebuild this latest system with NS7, reload the various apps, and then circle back to this last piece of creating pdf sandwiches.

Thanks for the help. Hopefully, with what you’ve given me so far, I shouldn’t need to get back to you, but if I’m still stuck I will certainly give you a shout out.

PS:

I had a bit of an episode in the old noggin about three years ago, so I’m having to relearn a tremendous amount of this stuff. My aggravation is that it now takes me hours to suss out what I used to just know, and any new bits take me days or weeks to get a grip on. Frustrating. Even more frustrating is that, unless I work with the exact same bits literally daily, I will forget most of it within a week or two and have to start over. This is why I’m having to circle back to NS (I was an early e-smith adopter a couple of decades ago) as it does most of what I need with very little fiddling under the covers.


(Stéphane de Labrusse) #6

Ns6 will die next year, it is time to move, moreover you are currently missing a lot of good things, ns6 is in maintainance mode. Systemd is a really nice feature for developers and of course at the end for sys admin


(Scott Smith) #7

Nice? Perhaps it depends on perspective.

To me, systemd is designed with devices in mind such as desktops, laptops, tablets and so forth. Maybe it’s even good for servers that are rebooted often or have a highly variable application suite. I can see some features of systemd being a real blessing for those environments. But for me, coming from an environment where we had thousands of servers, but only a handful of server configurations, and where one job/app would run weeks or months (or years in a few cases), most of the “benefits” of systemd are not actually beneficial and it represents change for the sake of change, not progress.

But, as I said, that was then.

Now, in my new post-supercomputing life?

Meh.

If systemd is what it takes to get a kitchen sink platform such as NS to work (or Koozali and similar, not to mention the desktop and portable device distros), then I’ll “forget” everything I used to know (okay, that was humor, since the brain malfunction has done that for me already) and just be an application installer and configuration manager. Whatever I can do via the server manager interface, that is essentially what I’ll do. I don’t see myself ever doing actual system administration or engineering or development again, so it doesn’t really matter what’s under the hood. So long as it works.

Just finished installing NS7 on a virtual machine. Now on to that installation and configuration part :slight_smile:


(Stéphane de Labrusse) #8

Yep this is what I need all the time, something robust, reliable and open minded to changes. I am not so OLD, I hope, but I already saw a lot of changes in the IT world, and I must say I never regretted my ZX-81 with 4 kB RAM and the tape recorder.

I am kidding, please do not throw potatoes :smiley:


(Stéphane de Labrusse) #9

Just one example how to modify a service to change the group ownership and to set the start order with other services. Of course we do not have the hand on the rspamd service, it comes from upstream, but we can manage it

Modularity, flexibility of systemd is increible, sysVinit are so difficult :slight_smile:


(Scott Smith) #10

All to the good - for a consumer solution or a wild and undisciplined
enterprise server farm. Totally unnecessary in the HPC environments I
used to work in. All this golly gee whizzardry was neither needed nor
useful there.

But, I’m an afternoon into NS7. I don’t find it a step up from NS6
except it’s compatible with newer packages. Otherwise, I think it’s
lost the plot as a drop dead simple SME solution. I can see that the
devs have taken control of the product :slight_smile:


(Scott Smith) #11

Back to the original issue, since some may be curious if the problem was solved.

The pdf files were being generated by Win10 systems, and being stored/managed/searched on NS. As I don’t have the will or time to fight through learning NS7, nor to coerce the OCR bits to run under NS6, it was easier/quicker to install PDF OCR X on one of the Windows systems, then create a Windows service to run the conversions. In a shared folder so NS6 can manage the results. There’s an hour lag between PDF creation and when the service converts it, but most searches happen 1+ days later so the lag isn’t much of an issue. From a purely technical point of view, it would have been better for this to run on Linux. From a real-world point of view, it doesn’t matter so long as the problem is solved.