strange characters when cleaning HTML with tidy?

Matt Price moptop99-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Sun Aug 19 17:02:46 UTC 2012


hi,

I am trying to use "tidy" to clean up the html generated by
libreoffice from an odt document.

Since most of my stuff now moves through the web, I usually just work
in emacs and export to my blog.  But for highly structured document
I'm still using libreoffice, which is fine till I try to export that
work to HTML and paste it into Wordpress.  when I try that, the
formatting ends up pretty terrible.

so I tried using tidy as per this suggestion:

http://www.techrepublic.com/blog/opensource/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/3708

The HTML is much much cleaner, but something wierd is happening with
non-ascii characters like "curly quotes."  You can see a minimal
example here:

'History and its Publics in a Digital Age' should be surrounded in
curly quotes, but instead I'm seeing

“

and

â€

The original libreoffice odt and the libreoffice-generated HTML are
both in Unicode (UTF-8), so I imagine there's some translation issue I
don't understand.  You can see the original export here:
http://sandbox.hackinghistory.ca/syllabus-original-exported.html

Anyone able to replicate this problem and/or ovfercome the problem? I
wonder if the issue might be in my tidy config file, but it's pretty
straightforward:

clean: yes
drop-proprietary-attributes: yes
drop-empty-paras: yes
output-html: yes
join-classes: yes
join-styles: yes
show-body-only: yes
force-output: yes
preerve-entities: yes
input-encoding: utf8
output-encoding: utf8

Any help is much appreciated!  Thanks,
Matt
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list