OCR of image PDF's from command line - any ideas?
PC Hardware Forum Index PC Hardware
Dicussion of PC hardware and peripherals
 
 FAQFAQ   MemberlistMemberlist    RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web hwtalk.net
OCR of image PDF's from command line - any ideas?

 
Post new topic   Reply to topic    PC Hardware Forum Index -> Scanners
Author Message
Ed
Guest





Posted: Wed Oct 05, 2005 8:03 pm    Post subject: OCR of image PDF's from command line - any ideas? Reply with quote

Hello wizards,

I need to accomplish the following task:

- Iterate through a large directory structure of files
- For each file found that is an image-only PDF (no text)
I need to OCR the file and save it in the same folder it
was found as origfilename_OCRed (PDF text format).

Despite a lot of searching and trying several OCR programs,
I have not been able to find a solution for OCRing from the
command line and converting multiple image PDF's into
multiple text PDF documents. I'd be happy with a solution
on either Windows or Linux, that doesn't cost huge $ and is
reasonably accurate. Neither OmniPage nor FineReader for
instance appear to have command-line options.

As a bonus, I'd love any ideas on how to recognize from the
command line whether a PDF file is image-only or text, since
I only want to OCR the image PDF files.

Thanks in advance!
--Ed Rozenberg
Back to top
Dirk Thierbach
Guest





Posted: Wed Oct 05, 2005 10:21 pm    Post subject: Re: OCR of image PDF's from command line - any ideas? Reply with quote

Ed <edrozenberg@hotmail.com> wrote:
Quote:
Despite a lot of searching and trying several OCR programs,
I have not been able to find a solution for OCRing from the
command line and converting multiple image PDF's into
multiple text PDF documents. I'd be happy with a solution
on either Windows or Linux, that doesn't cost huge $ and is
reasonably accurate. Neither OmniPage nor FineReader for
instance appear to have command-line options.

As a bonus, I'd love any ideas on how to recognize from the
command line whether a PDF file is image-only or text, since
I only want to OCR the image PDF files.

The utilities pdftotext resp. pdfimages convert a pdf file to text resp.
extract the images of a pdf file. Both work under Linux and are open
source, so it should be possible to get them somehow to work under Windows,
too. You can find them for example in the xpdf-utils package in Debian
Linux.

I guess it should also be possible to feed the extracted images
into some OCR program, and have the OCR program then create a text
pdf file for you. Someone else might know more about that than I do.

- Dirk
Back to top
Fred Toewe
Guest





Posted: Thu Oct 06, 2005 5:28 am    Post subject: Re: OCR of image PDF's from command line - any ideas? Reply with quote

Ed,

Have you looked at "OmniPage Agent" within the ScanSoft Omnipage Pro
features? Seems like it might cover your needs.

Fred
===============
"Ed" <edrozenberg@hotmail.com> wrote in message
news:1128524587.224174.127080@f14g2000cwb.googlegroups.com...
Quote:
Hello wizards,

I need to accomplish the following task:

- Iterate through a large directory structure of files
- For each file found that is an image-only PDF (no text)
I need to OCR the file and save it in the same folder it
was found as origfilename_OCRed (PDF text format).

Despite a lot of searching and trying several OCR programs,
I have not been able to find a solution for OCRing from the
command line and converting multiple image PDF's into
multiple text PDF documents. I'd be happy with a solution
on either Windows or Linux, that doesn't cost huge $ and is
reasonably accurate. Neither OmniPage nor FineReader for
instance appear to have command-line options.

As a bonus, I'd love any ideas on how to recognize from the
command line whether a PDF file is image-only or text, since
I only want to OCR the image PDF files.

Thanks in advance!
--Ed Rozenberg
Back to top
Dances With Crows
Guest





Posted: Thu Oct 06, 2005 7:12 pm    Post subject: Re: OCR of image PDF's from command line - any ideas? Reply with quote

["Followup-To:" header set to comp.periphs.scanners.]
On Wed, 5 Oct 2005 19:21:29 +0200, Dirk Thierbach staggered into the
Black Sun and said:
Quote:
Ed <edrozenberg@hotmail.com> wrote:
Despite a lot of searching and trying several OCR programs, I have
not been able to find a solution for OCRing from the command line and
converting multiple image PDF's into multiple text PDF documents.
I'd be happy with a solution on either Windows or Linux, that doesn't
cost huge $ and is reasonably accurate. Neither OmniPage nor
FineReader for instance appear to have command-line options.

This is par for the course when dealing with Windows programs, sadly
enough. They sell SDKs with bindings for C, Visual Baysick, and
possibly Java bindings for these commercial OCR engines, but that's
probably more money than you want to spend.

It's possible to control the TypeReader commercial OCR application with
DDE, but that sort of requires writing C/C++ code. I've done this;
holler at my e-mail (mind the SPAN TRAP) for some more information.

Quote:
As a bonus, I'd love any ideas on how to recognize from the command
line whether a PDF file is image-only or text
The utilities pdftotext [and] pdfimages convert a pdf file to text
[and] extract the images of a pdf file.

This is one possibility. The thing is, these utilities take some time
to run, particularly on large PDFs. There should be a reasonably simple
way to look at the raw PDF and determine whether it's full of images or
full of text, but I don't have time to gin up a utility to do that just
now.

Quote:
I guess it should also be possible to feed the extracted images into
some OCR program, and have the OCR program then create a text pdf file

Yes. The main problem with it is that the Free OCR engines that I've
seen are not really very good. If you have a commercial OCR engine
produce text, you can then feed that text into enscript, then into
ps2pdf. This will pretty much kill the layout, but it'll produce a text
PDF, no problem.

--
Matt G|There is no Darkness in eternity/But only Light too dim for us to see
Yesterday upon the stair, I met a man who wasn't there.
He wasn't there again today -- I think he's from the CIA.
Back to top
Ed
Guest





Posted: Fri Oct 07, 2005 3:59 am    Post subject: Re: OCR of image PDF's from command line - any ideas? Reply with quote

Thanks for your ideas everyone - it looks like there are few if any
options
for command line use other than purchasing and developing against
SDK's. So I gave the built-in GUI automation options a go again:

- I tried the Omniscan Batch Agent again with no luck - it "choked"
when
I tried to feed it as few as 3 documents to be automatically
converted.

- I was successful with the ABBYY FineReader Automation Manager. I
set up a workflow including the steps Read -> Process -> Save.
Gave it an input directory "OCR" and an output directory OCR_OUT.
Put 150 image PDF files in the OCR directory and ran the automation
agent on it. Several hours later it produced 150 text PDF's as a
result
and they look good. One thing that I find funny is that it loaded
all the
pages for all the documents first (1000's of pages) then OCR'd them
one at a time. It then saved them as PDF files with the original
source
file names, which I what I wanted. There were a number of warnings
regarding some pages that were too rotated and some other problems
with a few of the pages, but otherwise looks great. I haven't found
a way
to easily jump to the few error pages out of the 1000's of pages, but
happy enough for now with the results.

Regards,
--Ed
Back to top
Homer J Simpson
Guest





Posted: Sat Oct 08, 2005 3:42 am    Post subject: Re: OCR of image PDF's from command line - any ideas? Reply with quote

"Ed" <edrozenberg@hotmail.com> wrote in message
news:1128639544.436274.126510@z14g2000cwz.googlegroups.com...

Quote:
Thanks for your ideas everyone - it looks like there are few if any
options
for command line use other than purchasing and developing against
SDK's. So I gave the built-in GUI automation options a go again:

ABBYY Fine Reader Pro does this like a charm. You can feed it a whole book
in PDF form and it will spit out a new version at the end that is still a
PDF but has the page images over the text and is much smaller. Or you can
output whatever you want.
Back to top
Milind Joshi
Guest





Posted: Wed Oct 12, 2005 8:41 pm    Post subject: Re: OCR of image PDF's from command line - any ideas? Reply with quote

Hi Ed,

We'd be happy to build exactly such a program for you. In fact, we have
several such programs, and have deployed them in scripting
environments.

We're able to get you a high accuracy rate by using multiple engines if
necessary.

Alternatively, you could send us the PDF files, we would convert them
and send them back to you. This is a good option if you have a one-time
requirement.

Contact us at info@ideatechnosoft.com with your volumes, etc., and we
can work something out.

Regards,
Milind Joshi

IDEA TECHNOSOFT INC.
http://www.ideatechnosoft.com
Back to top
 
Post new topic   Reply to topic    PC Hardware Forum Index -> Scanners All times are GMT
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




Electronics VoIP DSP
New Topics php BB