How to scan text pages for smallest-size pdf?
PC Hardware Forum Index PC Hardware
Dicussion of PC hardware and peripherals
 
 FAQFAQ   MemberlistMemberlist    RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web hwtalk.net
How to scan text pages for smallest-size pdf?

 
Post new topic   Reply to topic    PC Hardware Forum Index -> Scanners
Author Message
Al
Guest





Posted: Fri Oct 14, 2005 5:03 am    Post subject: How to scan text pages for smallest-size pdf? Reply with quote

Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

What are the most efficient settings to use when scanning with this
goal in mind? These pages are photocopies of abstracts from
scientific journals, so most pages are all text; a few have charts
or graphs.

We've tried scanning some pages at 72dpi but they're not readable
on screen. When we scan them at 100+ dpi the resulting pdf is
pretty large (a 7-page doc turned into a 2MB pdf, which seems
too big).

Meanwhile, someone sent us a 75pp document scan and the pdf
was only 1MB! Unfortunately they didn't create the pdf, so they
don't know why it has such a small file-size.

Any tips are appreciated.
Back to top
Andre Majorel
Guest





Posted: Fri Oct 14, 2005 4:02 pm    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

On 2005-10-14, Al <acunniff@advancedbionutrition.com> wrote:

Quote:
Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

What are the most efficient settings to use when scanning with this
goal in mind? These pages are photocopies of abstracts from
scientific journals, so most pages are all text; a few have charts
or graphs.

We've tried scanning some pages at 72dpi but they're not readable
on screen. When we scan them at 100+ dpi the resulting pdf is
pretty large (a 7-page doc turned into a 2MB pdf, which seems
too big).

The most size-efficient way I know is to scan in black and white
and compress with the CCITT Group 4 or LZW algorithm. I store
text scans in TIFF/Group 4 format (scanimage --mode lineart |
pnmtotiff -g4 >foo.tif) and the size is on the order of 100 kB
per A4 page at 600 DPI (50 kB at 300 DPI). If PDF supports Group
4 compression, and I think it does, you'll get similar figures.

For on-screen reading, it's often preferable to scan at lower
resolutions (around 70-100 DPI) but in greyscale, and either (a)
quantise to somewhere between 3 and 8 grey levels (pnmdepth 2)
and use a lossless compression algorithm like PNG (scanimage
--mode greyscale | pnmdepth 3 | pnmtopng >foo.png) or (b) use a
lossy algorithm like JPG.

--
André Majorel <URL:http://www.teaser.fr/~amajorel/>
It's a good life, bowing to a tyrant.
Back to top
Dave Plumpe
Guest





Posted: Fri Oct 14, 2005 5:26 pm    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

Your scan was pdf-encoded as a graphic, not as text. You need to use the
original document files, or optical-character-recognize (OCR) your scans to
recover the text portions as text, not graphics, then encode to pdf .
-Dave
--
http://plumpe.home.mindspring.com
email: lastname@mindspring.com
ANTI-SPAM: To email, replace "lastname" with "plumpe"

"Al" <acunniff@advancedbionutrition.com> wrote in message
news:1129247697.923897.258970@o13g2000cwo.googlegroups.com...
Quote:
Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

What are the most efficient settings to use when scanning with this
goal in mind? These pages are photocopies of abstracts from
scientific journals, so most pages are all text; a few have charts
or graphs.

We've tried scanning some pages at 72dpi but they're not readable
on screen. When we scan them at 100+ dpi the resulting pdf is
pretty large (a 7-page doc turned into a 2MB pdf, which seems
too big).

Meanwhile, someone sent us a 75pp document scan and the pdf
was only 1MB! Unfortunately they didn't create the pdf, so they
don't know why it has such a small file-size.

Any tips are appreciated.
Back to top
lostinspace
Guest





Posted: Fri Oct 14, 2005 6:31 pm    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

----- Original Message -----
From: "Al" <>
Newsgroups: comp.periphs.scanners
Sent: Thursday, October 13, 2005 8:03 PM
Subject: How to scan text pages for smallest-size pdf?


Quote:
Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

What are the most efficient settings to use when scanning with this
goal in mind? These pages are photocopies of abstracts from
scientific journals, so most pages are all text; a few have charts
or graphs.

We've tried scanning some pages at 72dpi but they're not readable
on screen. When we scan them at 100+ dpi the resulting pdf is
pretty large (a 7-page doc turned into a 2MB pdf, which seems
too big).

Meanwhile, someone sent us a 75pp document scan and the pdf
was only 1MB! Unfortunately they didn't create the pdf, so they
don't know why it has such a small file-size.

Any tips are appreciated.


You have two options.
One of which provides the smallest size files was explained by Dave and
advising you to OCR.

The only other "reasonable" option is to scan from within Acrobat having
selcted
"Black and White/Line Art"
(or what ever your software calls it) at 150DPI.
Then save as PDF.

I frequently scan small font, two column text at 400 dpi with up to 10 pages
in
"Black and White/Line Art" and the file size rarely exceeds 600k.

I found 150 (your 72 is ineffective) to be the lowest recognizable and
printable setting. However, even this may dependent on the quality of the
printed materials that you are scanning from.

In the event that your attempting to scan in color?
Forget about it! It's just not possible to get the file size down to
anything reasonable.
Back to top
Lorenzo J. Lucchini
Guest





Posted: Fri Oct 14, 2005 6:50 pm    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

lostinspace wrote:
Quote:
----- Original Message -----
From: "Al"
Newsgroups: comp.periphs.scanners
Sent: Thursday, October 13, 2005 8:03 PM
Subject: How to scan text pages for smallest-size pdf?



Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

[snip]

Any tips are appreciated.



You have two options.

[snip]

A third option I can think of... don't know if something like this is
available inside Acrobat, but it's certainly possible in theory.

Convert the scan to vector graphics. That won't have the accuracy
problems of OCR, and it should still gain a good size advantage.

But anyway, what options does Acrobat offer for image compression? We're
talking black and white text: I though standard (lossless or lossy)
compression methods could shrink such data to good extents.


by LjL
ljlbox@tiscali.it
Back to top
Dances With Crows
Guest





Posted: Fri Oct 14, 2005 8:13 pm    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

On Fri, 14 Oct 2005 15:50:49 +0200, Lorenzo J. Lucchini staggered into
the Black Sun and said:
Quote:
lostinspace wrote:
Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make these
pdfs as small in file-size as possible,
You have two options.
A third option I can think of... Convert the scan to vector graphics.
That won't have the accuracy problems of OCR, and it should still gain
a good size advantage.

What? Scanners produce raster images, and I'd think that in the general
case, raster->vector would *add* size rather than subtract it.

Quote:
But anyway, what options does Acrobat offer for image compression?
We're talking black and white text: I though standard (lossless or
lossy) compression methods could shrink such data to good extents.

Group4 TIFF is lossless and extremely efficient at compressing things.
An 8.5x11" page scanned at 300DPI in Group4 will be about 50-100K
depending on image complexity and how many black pixels you have. It'd
be smaller if it were scanned at 150DPI, of course. I don't know
whether Acrobrat uses Group4 automagically for black-n-white source
images, but it might. It might also do something stupid. Try it and
see.

Of course, for text pages, OCRed ASCII/ISO-8859-15 is smaller than any
image format and you can grep it. OCR accuracy depends a lot on how
clean the source image is. Something that was printed on a decent
printer, scanned straight, and didn't have any dirt on it should give
pretty high accuracy with a recent commercial OCR engine. If you need
100% accuracy, though, you'll have to have a human proofread it and
correct it. This takes forever and is boring as hell.

--
Matt G|There is no Darkness in eternity/But only Light too dim for us to see
Frustration is annoying, but the *real* disasters in life begin when you
get exactly what you want.
Back to top
Per Larsen
Guest





Posted: Fri Oct 14, 2005 8:27 pm    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

Al wrote:
Quote:
Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

What are the most efficient settings to use when scanning with this
goal in mind? These pages are photocopies of abstracts from
scientific journals, so most pages are all text; a few have charts
or graphs.

We've tried scanning some pages at 72dpi but they're not readable
on screen. When we scan them at 100+ dpi the resulting pdf is
pretty large (a 7-page doc turned into a 2MB pdf, which seems
too big).

Meanwhile, someone sent us a 75pp document scan and the pdf
was only 1MB! Unfortunately they didn't create the pdf, so they
don't know why it has such a small file-size.

Any tips are appreciated.

I usually scan text documents using 300 dpi black & white setting (CanoScan 5200F). Each A4-page then tends to be somewhere along 70 - 80 kB. I think 300 dpi at BW is well enough readable (even 200 dpi (about 50 kB each A4-page) scans is readable, but probably not too good if printed).

Using grayscale 100 dpi (160-170 kB) is readable and 150 dpi (300-350 kB) is 'good enough', but then the files are obviously very much larger.

If I OCR the document and then pdf it, it gets down to about 25 kB, but then the process takes longer time to complete and is very much harder to automate (using the OCR SW that came with the scanner - ScanSoft OmniPage 2.0 SE).

PerL
Back to top
Peter D
Guest





Posted: Fri Oct 14, 2005 8:39 pm    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

First of all, 2M is not big by todays attachment standards. It's only a
performance hit for dial-up users and people whose mail servers admins
haven't caught up yet -- even hotmail allows 250M and 5M attachments!

That said, why not examine the 1M pdf and see what you can figure out (you
can send it to me at pdf at dolman period ca -- ca, not com) -- if you
want). Check the properties, e-mail Adobe, test various settings (including
those suggested here). You could also ditch pdf and scan to jpg (very
compatible and compressible). Does it have to be pdf? How about ZIPping the
pdf file before attaching it?

Lots of soultions, lots of options.

Good luck. :-)

"Al" <acunniff@advancedbionutrition.com> wrote in message
news:1129247697.923897.258970@o13g2000cwo.googlegroups.com...
Quote:
Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.

What are the most efficient settings to use when scanning with this
goal in mind? These pages are photocopies of abstracts from
scientific journals, so most pages are all text; a few have charts
or graphs.

We've tried scanning some pages at 72dpi but they're not readable
on screen. When we scan them at 100+ dpi the resulting pdf is
pretty large (a 7-page doc turned into a 2MB pdf, which seems
too big).

Meanwhile, someone sent us a 75pp document scan and the pdf
was only 1MB! Unfortunately they didn't create the pdf, so they
don't know why it has such a small file-size.

Any tips are appreciated.
Back to top
Andre Majorel
Guest





Posted: Fri Oct 14, 2005 11:22 pm    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

On 2005-10-14, Lorenzo J. Lucchini <ljlbox@tiscali.it> wrote:
Quote:
Andre Majorel wrote:
On 2005-10-14, Al <acunniff@advancedbionutrition.com> wrote:

[snip]

[...]
quantise to somewhere between 3 and 8 grey levels (pnmdepth 2)
and use a lossless compression algorithm like PNG (scanimage
--mode greyscale | pnmdepth 3 | pnmtopng >foo.png) or (b) use a
lossy algorithm like JPG.

Heeey, someone using Unix, SANE and NetPBM! I almost thought I was alone
here :-)

That's two of us, then. :-) I don't use GUIs unless I have to.

Quote:
What scanner are you using with SANE?

A small HP ScanJet C7670A with the automatic sheet feeder. The
colours are way off and it's rather more expensive than the
competition but it was the only sheet feeder I could find at the
time. A couple models from Epson et al. purported to have a
sheet feeder option but it turned out to be vapourware.

I'm happy with it but I'm thinking about a second, bigger,
scanner (A3 perhaps). Just testing the waters, you know. :-)

--
André Majorel <URL:http://www.teaser.fr/~amajorel/>
It's a good life, bowing to a tyrant.
Back to top
Lorenzo J. Lucchini
Guest





Posted: Sat Oct 15, 2005 12:52 am    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

Andre Majorel wrote:
Quote:
On 2005-10-14, Al <acunniff@advancedbionutrition.com> wrote:

[snip]

[...]
quantise to somewhere between 3 and 8 grey levels (pnmdepth 2)
and use a lossless compression algorithm like PNG (scanimage
--mode greyscale | pnmdepth 3 | pnmtopng >foo.png) or (b) use a
lossy algorithm like JPG.

Heeey, someone using Unix, SANE and NetPBM! I almost thought I was alone
here :-)

What scanner are you using with SANE?


by LjL
ljlbox@tiscali.it
Back to top
catfish@hotmall.com
Guest





Posted: Sat Oct 15, 2005 4:00 am    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

"Al" <acunniff@advancedbionutrition.com> wrote:
Quote:
Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.
snip
Any tips are appreciated.

scan as BW, or Line Art, or one bit

do NOT scan as grey scale, half tone, or any of the color options.
Back to top
catfish@hotmall.com
Guest





Posted: Sat Oct 15, 2005 4:01 am    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

"catfish@hotmall.com" <catfish@hotmall.com> wrote:
Quote:
"Al" <acunniff@advancedbionutrition.com> wrote:
Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make
these pdfs as small in file-size as possible, so that we don't
clog the email boxes of people we need to send these files to.
snip
Any tips are appreciated.

scan as BW, or Line Art, or one bit

do NOT scan as grey scale, half tone, or any of the color options.

i hate replying to myself -

scan at either 150 or 300 dpi. 72 is not enough, and 600 or more is
overkill.
Back to top
Lorenzo J. Lucchini
Guest





Posted: Sat Oct 15, 2005 5:16 am    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

Dances With Crows wrote:
Quote:
On Fri, 14 Oct 2005 15:50:49 +0200, Lorenzo J. Lucchini staggered into
the Black Sun and said:

lostinspace wrote:

Our office has a number of documents that are up to 10 pages long,
and we need to scan them and save them as pdfs. We want to make these
pdfs as small in file-size as possible,

You have two options.

A third option I can think of... Convert the scan to vector graphics.
That won't have the accuracy problems of OCR, and it should still gain
a good size advantage.


What? Scanners produce raster images, and I'd think that in the general
case, raster->vector would *add* size rather than subtract it.

Uh? In the case of "block graphics" (I mean black and white graphics
with large areas of black and large areas of white) vector will
definitely be smaller...

I'm not really sure about text: if you scan at a low resolution and then
compress decently (such as the way you've mentioned below, that I
snipped), I think you could possibly be better off with raster.

On the other hand, you could scan at a higher resolution and then
convert to vector; this would have the advantage of being "infinite
resolution" -- not really, but you can print it at any size and not see
jaggies.

Quote:
[snip]

by LjL
ljlbox@tiscali.it
Back to top
Don
Guest





Posted: Sat Oct 15, 2005 8:53 pm    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

On Fri, 14 Oct 2005 23:22:27 +0000 (UTC), Andre Majorel
<amajorel@teezer.fr> wrote:

Quote:
On 2005-10-14, Lorenzo J. Lucchini <ljlbox@tiscali.it> wrote:

Heeey, someone using Unix, SANE and NetPBM! I almost thought I was alone
here :-)

That's two of us, then. :-)

Make that two-and-a-half... ;o)

Quote:
I don't use GUIs unless I have to.

Hear! Hear!

I only use Linux occasionally (not for scanning, though) but never use
a GUI with it. Indeed, I always log in as root which drives all my
Linux acquaintances nuts! ;o)

Don.
Back to top
Don
Guest





Posted: Sat Oct 15, 2005 8:53 pm    Post subject: Re: How to scan text pages for smallest-size pdf? Reply with quote

On Fri, 14 Oct 2005 19:01:37 -0400, "catfish@hotmall.com"
<catfish@hotmall.com> wrote:

Quote:
i hate replying to myself -

As someone once said:

I talk to myself because I like intelligent conversation. ;o)

Don.
Back to top
 
Post new topic   Reply to topic    PC Hardware Forum Index -> Scanners All times are GMT
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




Electronics VoIP DSP
New Topics php BB