Problems with pdftotext

Sun Oct 22 14:53:57 UTC 2006

On Sat, Oct 21, 2006 at 10:27:49PM -0400, John Vetterli wrote

> Long ago, some typesetter decided that the letter sequence "fi" looks 
> better if you print the two letters slightly overlapping.  Then later the 
> people who came up with unicode decided that the overlapping fi needs it 
> own character and assigned code-point FB01 to "LATIN SMALL LIGATURE FI".

  Interesting.  To quote the pdftotext manpage...

BUGS
     Some  PDF  files contain fonts whose encodings have been mangled beyond
     recognition.  There is no way (short of OCR) to extract text from these
     files.

> The solution, of course, is as Zbigniew suggested -- use a quickie script 
> to do a search-and-replace.

  That's what I ended up doing.  I copied the non-printable characters
into a sed script using vim block mode.  A bash script pipes pdftotext's
output directly to sed, avoiding an intermediate text file.

-- 
Walter Dnes <waltdnes-SLHPyeZ9y/tg9hUCZPvPmw at public.gmane.org> In linux /sbin/init is Job #1
My musings on technology and security at http://techsec.blog.ca
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists