Problems with pdftotext
John Vetterli
jvetterli-zC6tqtfhjqE at public.gmane.org
Sun Oct 22 02:27:49 UTC 2006
On Sat, 21 Oct 2006, Zbigniew Koziol wrote:
> To solve your problem 1) I would use an additional perl script that converts
> "a sequence of 4 weird non-printable characters" into fi. That way could
> possibly be used to solve problem 2). I would not trust pdf2text .
> On Saturday 21 October 2006 21:00, Walter Dnes wrote:
>> 1) You will *NOT* find the string "fi" anywhere in the output. Where
>> you would expect "fi" in a word, you get a sequence of 4 weird
>> non-printable characters.
>> 2) Opening and closing quotes are sequences of 5 non-printable
>> characters. Ditto for apostrophes.
Long ago, some typesetter decided that the letter sequence "fi" looks better
if you print the two letters slightly overlapping. Then later the people who
came up with unicode decided that the overlapping fi needs it own character
and assigned code-point FB01 to "LATIN SMALL LIGATURE FI". I would guess that
your 4 weird characters are some sort of non-ascii encoding for FB01.
Likewise, I would guess that the weird codes for your quotes and apostrophes
are for fancy opening or closing single or double quotes instead of the boring
both-opening-and-closing " and '.
The solution, of course, is as Zbigniew suggested -- use a quickie script to
do a search-and-replace.
This bit of useless info brought to you by:
JV
--
The Toronto Linux Users Group. Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
More information about the Legacy
mailing list