OCR under Linux

Lennart Sorensen lsorense-1wCw9BSqJbv44Nm34jS7GywD8/FfD2ys at public.gmane.org
Tue Sep 13 13:18:24 UTC 2005


On Tue, Sep 06, 2005 at 08:51:26PM +0000, Rob Sutherland wrote:
> I've got a client that I'm helping to escape the claws of the Great 
> Beast - he's
> swinging towards Linux for his next desktop. Th only sticking point is that
> he wants to scan newspaper articles and OCR them. I don't do much 
> scanning so
> I'm not sure about the state of the art in linux OCR. Anyone have any 
> recommendations,
> tips etc?

I haven't tried but here is a first place to look:
athlon:~# apt-cache search ocr
clara - Free OCR program for Unix Systems
gocr - A command line OCR
gocr-doc - gocr documentation
gocr-gtk - A GTK wrapper around gocr
gocr-tk - A tcl/tk wrapper around gocr
gstreamer0.8-misc - Collection of various GStreamer plugins
kaudiocreator - CD ripper and audio encoder frontend for KDE
kooka - scanner program for KDE
libgocr-dev - API set to write your own OCR engine - development files
libgocr-doc - API set to write your own OCR engine - documentation
libgocr0 - API set to write your own OCR engine - runtime libs
nec - NEC2 Antenna Modelling System
ocrad - Optical Character Recognition program
quiteinsane - A Qt based X11 frontend for SANE (Scanner Access Now Easy)
ksocrat - English/Russian and Russian/English Dictionary
ksocrat-data - English and Russian KSocrat data files
ksubtitleripper - GUI for KDE to rip DVD subtitles
pymusique - iTMS client
gstreamer-misc - Collection of various GStreamer plugins

I imagine a few false hits in there too.

Details on the more obvious ones above:

Package: clara
Priority: optional
Section: text
Installed-Size: 838
Maintainer: Eduardo Marcel Macan <macan-8fiUuRrzOP0dnm+yROfE0A at public.gmane.org>
Architecture: i386
Version: 0.9.9-1.1
Depends: libc6 (>= 2.3.1-1), xlibs (>> 4.1.0)
Suggests: perl
Filename: pool/main/c/clara/clara_0.9.9-1.1_i386.deb
Size: 328666
MD5sum: a1a171738d30aa17ad9aa5b73d2b9285
Description: Free OCR program for Unix Systems
 Clara OCR is a free (GPL) OCR for systems that support
 the C library and the X window system (e.g. most
 flavours of Unix).
 .
 Clara OCR is intended for large scale digitalization
 projects. It features a powerful GUI and a web interface
 for cooperative digitalization of books.
Tag: accessibility::ocr, interface::web, interface::x11, role::sw:application, use::converting, works-with::image:raster, x11::application

Package: ocrad
Priority: optional
Section: graphics
Installed-Size: 380
Maintainer: Miguel Gea Milvaques <debian-D+u8wzPBKT2B+jHODAdFcQ at public.gmane.org>
Architecture: i386
Version: 0.12-2
Depends: libc6 (>= 2.3.5-1), libgcc1 (>= 1:4.0.1), libstdc++6 (>= 4.0.1)
Filename: pool/main/o/ocrad/ocrad_0.12-2_i386.deb
Size: 134628
MD5sum: 32187a0bb8a6cb561d4f528b540e0a53
Description: Optical Character Recognition program
 GNU Ocrad is an OCR (Optical Character Recognition) program based on a
 feature extraction method. It reads a bitmap image in pbm format and
 produces text in byte (8-bit) or UTF-8 formats.
 .
 Ocrad includes a layout analyzer able to separate the columns or blocks
 of text normally found on printed pages.
 Homepage: http://www.gnu.org/software/ocrad/ocrad.html
Tag: interface::commandline, role::sw:utility, use::converting, works-with::image:raster

Package: quiteinsane
Priority: optional
Section: graphics
Installed-Size: 1904
Maintainer: Aurelien Jarno <aurel32-8fiUuRrzOP0dnm+yROfE0A at public.gmane.org>
Architecture: i386
Version: 0.10-8
Depends: libaudio2, libc6 (>= 2.3.2.ds1-4), libfontconfig1 (>= 2.2.1), libfreetype6 (>= 2.1.5-1), libgcc1 (>= 1:3.4.1-3), libice6 | xlibs (>> 4.1.0), libieee1284-3, libjpeg62, libpng12-0 (>= 1.2.8rel), libqt3c102-mt (>= 3:3.3.3), libsane (>= 1.0.11-3), libsm6 | xlibs (>> 4.1.0), libstdc++5 (>= 1:3.3.4-1), libtiff4, libusb-0.1-4 (>= 1:0.1.10a), libx11-6 | xlibs (>> 4.1.0), libxcursor1 (>> 1.1.2), libxext6 | xlibs (>> 4.1.0), libxft2 (>> 2.1.1), libxrandr2 | xlibs (>> 4.3.0), libxrender1, libxt6 | xlibs (>> 4.1.0), zlib1g (>= 1:1.2.1), gocr
Filename: pool/main/q/quiteinsane/quiteinsane_0.10-8_i386.deb
Size: 846488
MD5sum: 3c05979c1a2b640c9f78ce8c99a11ecf
Description: A Qt based X11 frontend for SANE (Scanner Access Now Easy)
 QuiteInsane is a graphical frontend for SANE (Scanner Access Now Easy). It
 can save an image to a file in a variety of image formats, send an image to
 a printer or do OCR (Optical Character Recognition) using gocr.
 .
 SANE stands for "Scanner Access Now Easy" and is an application programming
 interface (API) that provides standardized access to any raster image scanner
 hardware (flatbed scanner, hand-held scanner, video- and still-cameras,
 frame-grabbers, etc.).
 .
  Author:   Michael Herder <crapsite-hi6Y0CQ0nG0 at public.gmane.org>
  Homepage: http://quiteinsane.sourceforge.net
Tag: interface::x11, uitoolkit::qt, use::downloading, works-with::image:raster, x11::application

Package: gocr
Priority: optional
Section: graphics
Installed-Size: 620
Maintainer: Cosimo Alfarano <kalfa-8fiUuRrzOP0dnm+yROfE0A at public.gmane.org>
Architecture: i386
Version: 0.39-5
Depends: libc6 (>= 2.3.2.ds1-4), libnetpbm10
Recommends: libjpeg-progs, bzip2, netpbm, transfig
Suggests: gocr-doc
Filename: pool/main/g/gocr/gocr_0.39-5_i386.deb
Size: 306308
MD5sum: 3e11884d80d06716ce92726513724010
Description: A command line OCR
 gocr is a multi-platform OCR (Optical Character Recognition) program.
 .
 It can read pnm, pbm, pgm, ppm, some pcx and tga image files.
 .
 Currently the program should be able to handle well scans that have their text
 in one column and do not have tables. Font sizes of 20 to 60
 pixels are supported.
 .
 If you want to write your own OCR, libgocr is provided in a separate
 package. Documentation and graphical wrapper are provided in separated
 packages, too.
Tag: accessibility::ocr, interface::commandline, made-of::lang:c, role::sw:application, use::converting, works-with::image:raster

Maybe something there works.

Lennart Sorensen
--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml





More information about the Legacy mailing list