strange characters when cleaning HTML with tidy?
Matt Price
moptop99-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Sun Aug 19 17:02:46 UTC 2012
hi,
I am trying to use "tidy" to clean up the html generated by
libreoffice from an odt document.
Since most of my stuff now moves through the web, I usually just work
in emacs and export to my blog. But for highly structured document
I'm still using libreoffice, which is fine till I try to export that
work to HTML and paste it into Wordpress. when I try that, the
formatting ends up pretty terrible.
so I tried using tidy as per this suggestion:
http://www.techrepublic.com/blog/opensource/how-to-convert-doc-and-odf-files-to-clean-and-lean-html/3708
The HTML is much much cleaner, but something wierd is happening with
non-ascii characters like "curly quotes." You can see a minimal
example here:
'History and its Publics in a Digital Age' should be surrounded in
curly quotes, but instead I'm seeing
“
and
â€
The original libreoffice odt and the libreoffice-generated HTML are
both in Unicode (UTF-8), so I imagine there's some translation issue I
don't understand. You can see the original export here:
http://sandbox.hackinghistory.ca/syllabus-original-exported.html
Anyone able to replicate this problem and/or ovfercome the problem? I
wonder if the issue might be in my tidy config file, but it's pretty
straightforward:
clean: yes
drop-proprietary-attributes: yes
drop-empty-paras: yes
output-html: yes
join-classes: yes
join-styles: yes
show-body-only: yes
force-output: yes
preerve-entities: yes
input-encoding: utf8
output-encoding: utf8
Any help is much appreciated! Thanks,
Matt
--
The Toronto Linux Users Group. Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists
More information about the Legacy
mailing list