UTF8 encoding & Bash

Fri Jul 31 07:31:34 UTC 2009

On Thu, Jul 30, 2009 at 10:18:05AM -0400, Lennart Sorensen wrote:
> On Wed, Jul 29, 2009 at 09:48:53PM -0400, Madison Kelly wrote:
> > Hi all,
> >
> >   I've got a problem that I *thought* was perl, but turns out to be more 
> > fundamental. I switched to bash and the problem remains...
> >
> >   There is a program called 'fantasdic' which takes a Japanese character 
> > and returns information on it. When it outputs to STDOUT it's fine, but 
> > when the output is redirected to a file, 0 bytes are written. This only 
> > happens with certain characters, no less.
> >
> >   For example;
> >
> > -----------------------------------------------
> > #!/bin/bash
> >
> > echo "### Start 右 ###"
> > /usr/bin/fantasdic -o Japanese 右
> >
> > echo "### Start 林 ###"
> > /usr/bin/fantasdic -o Japanese 林
> > -----------------------------------------------
> >
> >   Run this (as a script or as individual commands) and the output is  
> > clearly there and complete.
> >
> >   However, if you redirect the output to a file, the first character  
> > will write out to the file where the second one will not. Actually, most  
> > of the output from the first character will write... it seems to  
> > double-encode at one point and bail, but I digress as I suspect solving  
> > the second command's problem will solve the first. :)
> >
> >   Any idea what's going on here?
> 
> The program stupidly requries access to X to run.   If it has a command
> line interface i addition to the X one, it should at least run without
> access to X when used on the command line.
> 
> Seems someone wrote that without realizing what libs they were pulling in.

Seems more like this is made to be a GTK+ app. It's host on GNOME:

    http://projects.gnome.org/fantasdic/

The console output is probably an afterthought. It's supposed to support
Win32 and there is code in there to determine if there is a Win32
console too... but the STDOUT sections all specify '\n' for linebreaks
in places instead of using '\n' for unix and '\r\n' for a Windows
console.

Also, once I installed it from SVN it would immediately crash with
backtraces (when using the '-o' stdout option). Apparently, it's setup
by default to look for dictionary files for things like English and
Japanese... only it doesn't define any default files, causing the thing
to crash. Even the GUI crashes if you try to search before fooling
around in the preferences. I had to switch it over to using DICT servers
and then it started working (because there are default URLS) There are a
lot of flakey parts of this software.

There site even says "Fantasdic comes with pre-configured dictionaries
but dictionaries can be changed or added in the settings at any time."
Heh. If you call pre-configured to look at a null filepath for
definitions 'pre-configured' then yeah, it is.

> It also generates ^M at the end of lines, which is not unix compatible.
> Except the last line which contains two pieces of non UTF8 garbage.
> Looks like a bug for sure.

That non UTF8 garbage might be from something else, see below.

> Of course it may also be missing the output and not using it in the
> normal buffered way which might explain why the output is truncated.
> 
> Even running just the first fantasdic command gets truncated, never mind
> the rest.  Seems like it is just in general a very buggy program.
> It clearly doesn't use normal buffered output, nor does it seem to
> understand proper line endings.
> 

The no output to file
---------------------

 * If I replace all of the 'puts' calls with '$stderr.puts' in
   'command_line.rb' then run:

    fantasdic -o Japanese 林 2> tmp

   It works.

 * Seems that it _is_ just a buffering problem. In fantasdic.rb replace:

        define(ARGV[0], ARGV[1])

   With this:

        define(ARGV[0], ARGV[1])
        $stdout.flush

    Seems to work now. You could also do:

        $stdout.sync = true
        define(ARGV[0], ARGV[1])

    instead.

The '\r' issue
--------------

 * Looks like the carriage returns are coming from the definition
   source. I didn't delve too far into the code the pings the DICT
   server, though. The definitions are just printed out wholesale in the
   format that they were fetched, is the way it seems.

 * If you replace this line:

        puts d.body;

   with this line:

        puts d.body.gsub(/\r\n?/, "\n")

   This will strip out any carriage returns that are part of a Windows
   'newline'.

 * My suspicion is that either the DICT server is returning the
   definition with the '\r\n' in there (most likely) or the '\r\n' are
   part of the DICT protocol (IIRC, some protocols use '\r\n', NNTP
   comes to mind, but I was reading a C# implementation of it) and the
   author of fantasdic doesn't realize that they need to be stripped
   out.

Note: I'm operating off of the SVN version here so there aren't any
upstream changes that I'm missing. Though there might be some
differences between what I'm seeing and what's in a package management
system (apt,yum,etc).

The 'non UTF8' Garbage
----------------------

 * It's probably from the DICT server output. The last thing that is
   prints to STDOUT is a 'puts d.body' call. See the '\r' issue.

-- 
Brandon Sandrowicz

--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists