Problems with pdftotext

Walter Dnes waltdnes-SLHPyeZ9y/tg9hUCZPvPmw at public.gmane.org
Sun Oct 22 01:00:01 UTC 2006


  I intend to write a blog entrey, and possibly send a letter about Ms.
Cavoukian's support of "The 7 Laws".  I want to include a point-by-point
refutation of some of her more outrageous whoppers.  I downloaded the
PDF file from the Ontario government website, and ran it through
pdf2text, so I could include quotes from it.  Let's just say that the
translation didn't come out 100% perfect.  The problems include...

  1) You will *NOT* find the string "fi" anywhere in the output.  Where
you would expect "fi" in a word, you get a sequence of 4 weird
non-printable characters.

  2) Opening and closing quotes are sequences of 5 non-printable
characters.  Ditto for apostrophes.

  Item 2 might be an issue with weird fonts, but item 1 looks like a
bug.  A couple of questions...
  1) Can I over-ride this behaviour with a config file?
  2) If not, how do I enter control characters and high-bit ascii into
sed?  I've copied the weird characters with a vim block copy, and it
works, but I'd prefer \0nnn format, if possible.

  I do have a working sed command file, but I don't like the kludgy look
of it.

-- 
Walter Dnes <waltdnes-SLHPyeZ9y/tg9hUCZPvPmw at public.gmane.org> In linux /sbin/init is Job #1
My musings on technology and security at http://techsec.blog.ca
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list