strange characters when cleaning HTML with tidy?

Matt Price moptop99-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Mon Aug 20 11:10:01 UTC 2012


I think that typo might have derailed tidy -- fixing it, and making
the switch to char-encoding utf8 seems to have solved the problem.

On Sun, Aug 19, 2012 at 4:31 PM, Bob Jonkman <bjonkman-w5ExpX8uLjYAvxtiuMwx3w at public.gmane.org> wrote:
> Is this a typo or a transcription error? It looks to be missing an "s":
>
>> preerve-entities: yes
>
>
> Without having checked the Tidy docs, I suspect it should be
> "preserve-entities: yes". Perhaps when entities are preserved some of the
> odd characters will be turned into — and the like.
>
> --Bob.
>
>
> On 19/08/2012 1:02 PM, Matt Price wrote:
>>
>> hi,
>>
>> I am trying to use "tidy" to clean up the html generated by
>> libreoffice from an odt document.
>>
>> Since most of my stuff now moves through the web, I usually just work
>> in emacs and export to my blog.  But for highly structured document
>> I'm still using libreoffice, which is fine till I try to export that
>> work to HTML and paste it into Wordpress.  when I try that, the
>> formatting ends up pretty terrible.
>>
>> so I tried using tidy as per this suggestion:
>>
>>
>> http://www.techrepublic.com/blog/opensource/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/3708
>>
>> The HTML is much much cleaner, but something wierd is happening with
>> non-ascii characters like "curly quotes."  You can see a minimal
>> example here:
>>
>> 'History and its Publics in a Digital Age' should be surrounded in
>> curly quotes, but instead I'm seeing
>>
>> “
>>
>> and
>>
>> â€
>>
>> The original libreoffice odt and the libreoffice-generated HTML are
>> both in Unicode (UTF-8), so I imagine there's some translation issue I
>> don't understand.  You can see the original export here:
>> http://sandbox.hackinghistory.ca/syllabus-original-exported.html
>>
>> Anyone able to replicate this problem and/or ovfercome the problem? I
>> wonder if the issue might be in my tidy config file, but it's pretty
>> straightforward:
>>
>> clean: yes
>> drop-proprietary-attributes: yes
>> drop-empty-paras: yes
>> output-html: yes
>> join-classes: yes
>> join-styles: yes
>> show-body-only: yes
>> force-output: yes
>> preerve-entities: yes
>> input-encoding: utf8
>> output-encoding: utf8
>>
>> Any help is much appreciated!  Thanks,
>> Matt
>> --
>> The Toronto Linux Users Group.      Meetings: http://gtalug.org/
>> TLUG requests: Linux topics, No HTML, wrap text below 80 columns
>> How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
>>
> --
> The Toronto Linux Users Group.      Meetings: http://gtalug.org/
> TLUG requests: Linux topics, No HTML, wrap text below 80 columns
> How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list