Problems with pdftotext

Zbigniew Koziol softquake-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Sun Oct 22 01:22:24 UTC 2006


This reminds of problems of converting between html and pdf. Well, it does not 
fully work. And it will not. There is no full compatibility between both 
about how things are supposed to be displayed.

Some time ago a fellow asked me if I could convert any HTML to PDF. No, 
absolutely no. What he meant was not "HTML" in fact but a web page. And now, 
web pages are so complex often, thats not HTML only anymore, thats 
JavaScript, using DOM programming, perhaps even Flash.

To solve your problem 1) I would use an additional perl script that converts 
"a sequence of 4 weird non-printable characters" into fi. That way could 
possibly be used to solve problem 2). I would not trust pdf2text .

zb.

On Saturday 21 October 2006 21:00, Walter Dnes wrote:
>   I intend to write a blog entrey, and possibly send a letter about Ms.
> Cavoukian's support of "The 7 Laws".  I want to include a point-by-point
> refutation of some of her more outrageous whoppers.  I downloaded the
> PDF file from the Ontario government website, and ran it through
> pdf2text, so I could include quotes from it.  Let's just say that the
> translation didn't come out 100% perfect.  The problems include...
>
>   1) You will *NOT* find the string "fi" anywhere in the output.  Where
> you would expect "fi" in a word, you get a sequence of 4 weird
> non-printable characters.
>
>   2) Opening and closing quotes are sequences of 5 non-printable
> characters.  Ditto for apostrophes.
>
>   Item 2 might be an issue with weird fonts, but item 1 looks like a
> bug.  A couple of questions...
>   1) Can I over-ride this behaviour with a config file?
>   2) If not, how do I enter control characters and high-bit ascii into
> sed?  I've copied the weird characters with a vim block copy, and it
> works, but I'd prefer \0nnn format, if possible.
>
>   I do have a working sed command file, but I don't like the kludgy look
> of it.
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list