strange characters when cleaning HTML with tidy?

Bob Jonkman bjonkman-w5ExpX8uLjYAvxtiuMwx3w at public.gmane.org
Sun Aug 19 20:31:01 UTC 2012


Is this a typo or a transcription error? It looks to be missing an "s":

> preerve-entities: yes

Without having checked the Tidy docs, I suspect it should be 
"preserve-entities: yes". Perhaps when entities are preserved some of 
the odd characters will be turned into — and the like.

--Bob.

On 19/08/2012 1:02 PM, Matt Price wrote:
> hi,
>
> I am trying to use "tidy" to clean up the html generated by
> libreoffice from an odt document.
>
> Since most of my stuff now moves through the web, I usually just work
> in emacs and export to my blog.  But for highly structured document
> I'm still using libreoffice, which is fine till I try to export that
> work to HTML and paste it into Wordpress.  when I try that, the
> formatting ends up pretty terrible.
>
> so I tried using tidy as per this suggestion:
>
> http://www.techrepublic.com/blog/opensource/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/3708
>
> The HTML is much much cleaner, but something wierd is happening with
> non-ascii characters like "curly quotes."  You can see a minimal
> example here:
>
> 'History and its Publics in a Digital Age' should be surrounded in
> curly quotes, but instead I'm seeing
>
> “
>
> and
>
> â€
>
> The original libreoffice odt and the libreoffice-generated HTML are
> both in Unicode (UTF-8), so I imagine there's some translation issue I
> don't understand.  You can see the original export here:
> http://sandbox.hackinghistory.ca/syllabus-original-exported.html
>
> Anyone able to replicate this problem and/or ovfercome the problem? I
> wonder if the issue might be in my tidy config file, but it's pretty
> straightforward:
>
> clean: yes
> drop-proprietary-attributes: yes
> drop-empty-paras: yes
> output-html: yes
> join-classes: yes
> join-styles: yes
> show-body-only: yes
> force-output: yes
> preerve-entities: yes
> input-encoding: utf8
> output-encoding: utf8
>
> Any help is much appreciated!  Thanks,
> Matt
> --
> The Toronto Linux Users Group.      Meetings: http://gtalug.org/
> TLUG requests: Linux topics, No HTML, wrap text below 80 columns
> How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
>
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list