[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: OCR - 100% not even close





Je3 at aol_com wrote:

> 
> Wright
> 
> I work with document processing everyday. I have yet to see an OCR program that would even come close to 75-80% accuracy. The only way current OCR programs even approach this is with a freshly typed or printed page with a standard fixed width font. <
>

Jim, you must be working with *both* a poor, low-res. scanner *and* a weak
OCR program. I agree that both are out there and are all too common in the
corporate environment. Sounds about like law-office systems. ;-)

I used to design scanners, and have lots of experience with them. They have
come a long way in the last couple of years.

For good OCR, you do need a high-resolution (1200-2400 dpi) scanner, with
good thresholding settings, done right. [That means you also need a
screamingly-fast modern PC.] The scanner software and settings and the OCR
program need to be properly interfaced to get consistently satisfactory
results. IME that *never* seems to happen in the office environment. The
results are then as you point out. [known as GIGO]

A good system will do less than 95% on old mimeographs or typing with a very
worn out ribbon. Otherwise they do a wonderful job on point sizes down to 6
and even 4 points on any reasonable printing in almost any normal fonts. You
would only blow them up for an ancient 300-600dpi scanner.

I don't have any really old JAKAs but I bet they were done on competent
offset presses from the beginning, from carbon-ribbon typed masters. If so,
the OCR process shouldn't be very tough.

I considered the earlier suggestion to use Adobe Acrobat, and came down
strongly opposed for a variety of reasons. The business of the AKA is not
concealing information or making it hard to handle -- it's the fish and the
hobby. The costs should be recoverable, of course, but I would download one
or two articles for my own use from the web site. If I wanted more, I would
buy the stuff on CDs to help defray expenses. That would be the only
practical way to access such a large amount of data, anyway. If available, I
would buy it online, for download, and burn my own CDs. The incremental cost
to AKA would be the server time (not a lot).

I haven't been able to do keyword searches in Adobe Acrobat docs, but that
may just be me. Making the data searchable would be 90% of the value of
converting to electronic form. Let's at least use one that makes it easy.
OK?

We now return you to your regularly scheduled fish discussions. ;-)

Wright


-- 
Wright Huntley, Fremont CA, USA, 510 494-8679  huntleyone at home dot com

         "DEMOCRACY" is two wolves and a lamb voting on lunch.
     "LIBERTY" is a well-armed lamb denying enforcement of the vote.
             *** http://www.self-gov.org/index.html ***
---------------
See http://www.aka.org/AKA/subkillietalk.html to unsubscribe

References: