Royal Pain

Tue Jun 15 03:12:40 UTC 2004

On Mon, 14 Jun 2004, Robert Brockway wrote:
> > This happened on the Martian surface, actually, not en route.  And it was
> > rather more subtle than a full filesystem.
> 
> Do you have a source for this?  I've just reviewed several articles on the
> problem and it sure reads like a full flash based filesystem to me.

An account published in sci.space.policy early this year:

----------
The long and short of the Spirit debacle: Test what you intend to
do, and Do what you tested for. In the lab, they tested for 9 days
at a time, then cleaned things up, and ran again.

On Mars, they ran for over 18 days without doing any housecleaning.
You can get away with that on earth, you just have to vacuum twice...

As has been noted, Spirit has relatively little RAM for applications
and the OS - very little margin of error. The FLASH directory structure
and file structure required enough RAM for the dosFs cache that one last
malloc() was all it took to go over the top (malloc() puts the fat in the
FAT file system). The last malloc() failed, the task is suspended, the
dosvd semaphore never released, a high-prio task waiting on an "open"
causes a prio-inversion, another high priority task needs to write to a
heath monitor board (the RTI clock) but is block from running in time.

The RTI clock resets the system.

The system then re-initializes, rebuilds the dosFs cache... and the whole
thing happens all over again.
----------

(Henry again)

Note that the problem had nothing to do with the filesystem itself; the
problem was running out of program memory due to an excessively-complex
filesystem requiring a large in-memory cache.  The fix, after getting
control of the spacecraft back, was to delete a bunch of small files to
get the size of the cache down, and then (with stable operation now
possible) to clean house more carefully.

Yes, the Sunday Supplement version of this is "it got full", but it wasn't
the filesystem that was full.

The NASA press releases of the time are consistent with this, if read
carefully, although less detailed. 

> You're right about it occuring on the surface though.  What I was thinking
> about was a remark by one of the scientists that part of the problem had
> been files collected during flight.  I didn't realise the lander saw
> those, but there you go.

Yep.  This mission didn't have separate orbiter and lander; *all* the
brains were in the lander.  So yes, when files from the cruise phase were
left around, they were left around in the lander.

> > The issue with that, and some other past spacecraft problems, is that the
> > environment is complex and it's a judgement call as to where testing
> 
> There is no doubt it is complex, but I keep considering how the engineers
> of the 60s and 70s did so much more with so much less.  Pioneer 10 never
> filed a filesystem :)

Pioneer 10 didn't have an onboard computer at all.  And it suffered from
the lack, but at the time, the available space-rated computers were not
only far too heavy, but far too costly, Pioneer 10 being quite a low-cost
mission.  (Quoth the Pioneer 10 project manager:  "Of course Voyager 1 got
better pictures than we did -- their camera cost more than our whole
spacecraft.")

> Part of the problem, IMHO, are the serious funding cuts NASA has suffered
> since the 80s...

Perhaps, and perhaps not.  Note the above re Pioneer 10.  And NASA has had
embarrassing failures in the cost-is-no-object missions too.  Mars Observer
was rather bigger and much more expensive than any of the current Mars
missions, and the reason you probably don't even remember the name is that
it was a complete failure, lost before arrival.

> There have been quite a few in the last decade - and not just those
> launched by the Americans.

The easiest way not to have embarrassing failures is not to launch
anything.  The US did that pretty successfully in the *previous* decade.
It's not really surprising that there are more failures in more recent
times -- there are a lot more probes flying! 

NASA, since its formation, has fairly consistently had about a 25% mission
failure rate overall.  People who claim things have suddenly gotten worse
in recent years are "remembering" a mythical golden age, forgetting all
the things that went wrong back then. 

                                                          Henry Spencer
                                                       henry-lqW1N6Cllo0sV2N9l4h3zg at public.gmane.org

--
The Toronto Linux Users Group.      Meetings: http://tlug.ss.org
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://tlug.ss.org/subscribe.shtml