Problems with pdftotext
Walter Dnes
waltdnes-SLHPyeZ9y/tg9hUCZPvPmw at public.gmane.org
Sun Oct 22 14:53:57 UTC 2006
On Sat, Oct 21, 2006 at 10:27:49PM -0400, John Vetterli wrote
> Long ago, some typesetter decided that the letter sequence "fi" looks
> better if you print the two letters slightly overlapping. Then later the
> people who came up with unicode decided that the overlapping fi needs it
> own character and assigned code-point FB01 to "LATIN SMALL LIGATURE FI".
Interesting. To quote the pdftotext manpage...
BUGS
Some PDF files contain fonts whose encodings have been mangled beyond
recognition. There is no way (short of OCR) to extract text from these
files.
> The solution, of course, is as Zbigniew suggested -- use a quickie script
> to do a search-and-replace.
That's what I ended up doing. I copied the non-printable characters
into a sed script using vim block mode. A bash script pipes pdftotext's
output directly to sed, avoiding an intermediate text file.
--
Walter Dnes <waltdnes-SLHPyeZ9y/tg9hUCZPvPmw at public.gmane.org> In linux /sbin/init is Job #1
My musings on technology and security at http://techsec.blog.ca
--
The Toronto Linux Users Group. Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
More information about the Legacy
mailing list