Problems with pdftotext

John Vetterli jvetterli-zC6tqtfhjqE at public.gmane.org
Sun Oct 22 02:27:49 UTC 2006


On Sat, 21 Oct 2006, Zbigniew Koziol wrote:
> To solve your problem 1) I would use an additional perl script that converts
> "a sequence of 4 weird non-printable characters" into fi. That way could
> possibly be used to solve problem 2). I would not trust pdf2text .
> On Saturday 21 October 2006 21:00, Walter Dnes wrote:
>>   1) You will *NOT* find the string "fi" anywhere in the output.  Where
>> you would expect "fi" in a word, you get a sequence of 4 weird
>> non-printable characters.
>>   2) Opening and closing quotes are sequences of 5 non-printable
>> characters.  Ditto for apostrophes.

Long ago, some typesetter decided that the letter sequence "fi" looks better 
if you print the two letters slightly overlapping.  Then later the people who 
came up with unicode decided that the overlapping fi needs it own character 
and assigned code-point FB01 to "LATIN SMALL LIGATURE FI".  I would guess that 
your 4 weird characters are some sort of non-ascii encoding for FB01. 
Likewise, I would guess that the weird codes for your quotes and apostrophes 
are for fancy opening or closing single or double quotes instead of the boring 
both-opening-and-closing " and '.

The solution, of course, is as Zbigniew suggested -- use a quickie script to 
do a search-and-replace.

This bit of useless info brought to you by:
JV
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list