version-control/collaboration on odf documents?

Christopher Browne cbbrowne-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Mon Jun 15 18:37:02 UTC 2009


On Mon, Jun 15, 2009 at 1:52 PM, Evan Leibovitch<evan-ieNeDk6JonTYtjvyW6yDsg at public.gmane.org> wrote:
> Rajinder Yadav wrote:
>> I use Subversion (SVN) at work and home, the 2 front end UI clients on Linux are eSVN and RapindSVN.
>>
> There are many version control systems out there. But generally when
> serving binary files, they lose much of their benefit (such as diffs).
> Working with binary, most version control systems are little more than
> glorified archivers and checksum comparers.
>
> Now, the ODF file is a binary blob... but it's really just a zipped
> collection of a bunch of files, many of which are XML content and which
> *should* be able to benefit from text-based version control.

Actually, let me take a bit of a contrary position there...

It is not evident that XML content benefits from text-based version
control, because text-based diffs are not something that an XML-based
tool can make any use of.

Indeed, I just did a small test, taking a document I had lying around, and then:

a) Saving it in ODT form (to provide a baseline)
b) Modifying it further
c) Saving the modified version as a new ODT document.

I then extracted the two zip files and did comparisons.

chris at dba2:/tmp/v1> find -type f | xargs wc
     1   3018 137156 ./styles.xml
     0      1     39 ./mimetype
    21     64   1988 ./META-INF/manifest.xml
     1    162   7997 ./settings.xml
     0     10    121 ./layout-cache
     1     18    841 ./meta.xml
    15    106   5703 ./Thumbnails/thumbnail.png
     0      0      0 ./Configurations2/accelerator/current.xml
     1   4249  78149 ./content.xml
    40   7628 231994 total
chris at dba2:/tmp/v1> cd ../v2
chris at dba2:/tmp/v2> find -type f | xargs wc
     1   3018 137156 ./styles.xml
     0      1     39 ./mimetype
    21     64   1988 ./META-INF/manifest.xml
     1    162   8000 ./settings.xml
     0     10    125 ./layout-cache
     1     19    920 ./meta.xml
    12    105   5716 ./Thumbnails/thumbnail.png
     0      0      0 ./Configurations2/accelerator/current.xml
     1   4278  78742 ./content.xml
    37   7657 232686 total

Note that the relevant file is "content.xml", which is a fairly large XML file.

Using wc tells us that content.xml has just 1 line.

Thus, from a textual point of view, the fact that I made changes means
that a textual tool will discover it to have "completely changed."

> So _my_ refinement of the original question is: Can any of the existing
> version control systems out there that are able to work with binary
> files that
> unzip into a bunch of files that can be stored in version control as text?
>
> (there are other file formats that could benefit from such a facility)

The above discovery suggests that this sort of analysis requires a
much deeper and more "penetrating" sort of toolset; I would contend
that textual differences between XML files are no more useful than
binary differences.

There does exist a Python tool for figuring out differences between
similar XML files...   <http://www.logilab.org/859>

When I run xmldiff against the two versions of content.xml, I do
happen to get a smaller difference than the one given by Unix diff
:-).

Unfortunately, while it's small, that doesn't forcibly mean that the
result is useful.  It's not something that OpenOffice.org knows how to
interpret, and a visual interpretation (from reading the "XML
primitives") is pretty opaque.

I made a couple of textual changes, which *were* easy to pick out.

But I made a couple of font changes (marked some words as <bold>), and
that was pretty much opaque.

At this point, I'd tend to think that if I were going to apply an SCM
to this, I'd want to store the versions of the document in binary
form, and any tools that are to do analysis on it would need to be
smart enough to know that they need to extract data out of the .zip
files.

Only a Rather Intelligent examination of the files offers any way to
save space, and it needs to be *quite* cognizant of the format,
definitely to such a degree that an "ODT repository" could be expected
to be pretty useless for any other sort of document, and vice-versa.
-- 
http://linuxfinances.info/info/linuxdistributions.html
W. C. Fields  - "If I had to live my life over, I'd live over a
saloon." - http://www.brainyquote.com/quotes/authors/w/w_c_fields.html
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists





More information about the Legacy mailing list