Consensus on using Amazon EC2 for high volume sites

Thu Jun 9 15:18:57 UTC 2011

On Thu, Jun 9, 2011 at 3:01 PM, D. Hugh Redelmeier <hugh-pmF8o41NoarQT0dZR+AlfA at public.gmane.org> wrote:
> | From: Scott Elcomb <psema4-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org>
>
> | On Thu, Jun 9, 2011 at 9:30 AM, Myles Braithwaite
> | <me-qIX3qoPyADtH8hdXm2+x1laTQe2KTcn/@public.gmane.org> wrote:
> | > Also don't be like this guy[3] who put critical patients ECG machines on EC2.
> | >
> | > [3]: <https://forums.aws.amazon.com/thread.jspa?threadID=65649&tstart=0>
> |
> | Score one for cloud stupidity. I'd like to say that I can't believe a
> | business doing internet-based cardiac monitoring would be so careless
> | & negligent. I wonder what their insurance package looks like.
>
> In my mind, there are so many moving parts in today's systems that it
> is hard to imagine them being reliable.  Or even understanding all the
> risks.
>
> Single points of failure probably litter this guy's system.  It's easy
> to point at the Amazon system as being one.  Especially after the
> fact.  But that might be well down on the list of risks based on
> probablility.
>
> Number one: last-mile internet failure.
> Number two: power failure.
> Number zero: some bug in the code.
>
> It is amazing how much health care isn't even up to "best effort"
> standards.

The Reddit outage that was induced by AWS/EC2 problems showed off a
somewhat more nuanced aspect to this, in that the Amazon folks "tried
harder" than was apparent, only for the Reddit folk to discover that
this made things worse.

Basically, Amazon restored some filesystems to the "latest copy
available", which, as it happens, was a *worse* thing than dropping
the filesystems and letting Reddit do the recovery themselves.

Reddit was using the Londiste replication system (developed by folks
around Skype), with the expected result that if the "master" failed,
they could readily pass the baton over to a replica.

Unfortunately, the "master" failed in an unexpected way, namely in
that it "time travelled" backwards in time to a previous version of
the data.  They'd have been happy enough dropping the master, and
failing over to the replica, but instead, they had a
generally-functioning master node that's way behind the replica. They
had to stop the whole system HARD, for a while, to try to figure out
what to do next, the debate surrounding whether or not to try to keep
some of the "new" work done on the master.

The Reddit folks were prepared for an *utter* failure of the master,
but this rather odder degradation where Amazon did some "best efforts"
turned out to be worse than that.

The Slony team observed with interest; we could see that we'd have
much the same problem if they had been using Slony rather than
Londiste.

Understanding failures is a tough thing, particularly if the systems
make attempt to recover automatically.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"
--
The Toronto Linux Users Group.      Meetings: http://gtalug.org/
TLUG requests: Linux topics, No HTML, wrap text below 80 columns
How to UNSUBSCRIBE: http://gtalug.org/wiki/Mailing_lists