trunk process resilience

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

trunk process resilience

Chris Muller-4
Lately we've had some problems with the SqueakSource server that supports our vital trunk process.  Ken and I burned several hours on it this week.  The experience has caused me to consider an idea for improved continuity of our trunk repository.

Very simply, it's a second running copy of trunk (and inbox, et al).  Each instance keeps itself up to date from the other.  If one goes down, the other can be pointed to for updates AND commits to minimize disruption.

Right now, we actually already have two trunks.  Now, I'm pleased to announce that new-trunk running on box4.squeak.org is now a *full-copy* of old-trunk on box2.  (Before it was only trunk, now it includes Inbox, Etoys, etc.).  Using newer and better code and VM and also Magma, this copy of trunk was originally brought up simply to provide MC method history directly into the IDE, but now I can see its role being to improve trunk process stability so that community development can be continuous until it eventually becomes the defacto trunk (e.g., running source.squeak.org).

There are other side-benefits too, like the ability to move or upgrade the trunk without a service interruption.  We are assured to be ready to move to a different server on a moments notice, e.g., break the link with Hetzner.

So, I guess I'm proposing that we have some elements in the image "aware" of a second trunk.  But before wrangling out exactly what form that awareness would take, what do you think so far?



Reply | Threaded
Open this post in threaded view
|

Re: trunk process resilience

Frank Shearar-3
On 7 November 2013 21:07, Chris Muller <[hidden email]> wrote:

> Lately we've had some problems with the SqueakSource server that supports
> our vital trunk process.  Ken and I burned several hours on it this week.
> The experience has caused me to consider an idea for improved continuity of
> our trunk repository.
>
> Very simply, it's a second running copy of trunk (and inbox, et al).  Each
> instance keeps itself up to date from the other.  If one goes down, the
> other can be pointed to for updates AND commits to minimize disruption.
>
> Right now, we actually already have two trunks.  Now, I'm pleased to
> announce that new-trunk running on box4.squeak.org is now a *full-copy* of
> old-trunk on box2.  (Before it was only trunk, now it includes Inbox, Etoys,
> etc.).  Using newer and better code and VM and also Magma, this copy of
> trunk was originally brought up simply to provide MC method history directly
> into the IDE, but now I can see its role being to improve trunk process
> stability so that community development can be continuous until it
> eventually becomes the defacto trunk (e.g., running source.squeak.org).
>
> There are other side-benefits too, like the ability to move or upgrade the
> trunk without a service interruption.  We are assured to be ready to move to
> a different server on a moments notice, e.g., break the link with Hetzner.
>
> So, I guess I'm proposing that we have some elements in the image "aware" of
> a second trunk.  But before wrangling out exactly what form that awareness
> would take, what do you think so far?

I think before any person pitches in with any suggestion, that person
should go read up on handling state in a distributed system. (Because
having a second copy in a kind've active-active replication thing is
exactly a distributed system.) It is _not_easy_. (And "Never go to sea
with two chronometers; take one or three".) Here's a good starting
point: http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions

frank

Reply | Threaded
Open this post in threaded view
|

Re: trunk process resilience

Chris Muller-3
It's nice to read someone thinking about the issues faced by networked
programs.  In fact, the little experiment he does (under the heading
"A simple distributed system"), is exactly what one of Magma's HA
test-cases performs -- multiple clients perform rapid-fire commits as
fast as they can, counting upward, while the servers in the HA cluster
undergo various role changes due to HA events like arbitrarily killing
one of the servers with quitPrimitive.  It's quite a piece [1].

Thankfully, none of that applies to what's being proposed here, the
operations needed to achieve a mutual backup are idempotent -- simply
a package copy from remote to local using the existing
MCRepository>>#copyAllFrom:.  So, it uses existing error-handling too,
what could go wrong?

Under normal usage, the same person would not commit two different
UUID versions of a package, but with the same exact name, to each
repository.  But, even if they did, it's no different than when that
happens today between projects which, themselves, are simply different
repositories.

I've seen how fragile and unsustainable our source.squeak.org server
is.  I want to inform the community what I've done and solicit
pragmatic discussion on how we can get more out of it.

Thanks.

[1] -- http://wiki.squeak.org/squeak/6101

On Thu, Nov 7, 2013 at 3:33 PM, Frank Shearar <[hidden email]> wrote:

> On 7 November 2013 21:07, Chris Muller <[hidden email]> wrote:
>> Lately we've had some problems with the SqueakSource server that supports
>> our vital trunk process.  Ken and I burned several hours on it this week.
>> The experience has caused me to consider an idea for improved continuity of
>> our trunk repository.
>>
>> Very simply, it's a second running copy of trunk (and inbox, et al).  Each
>> instance keeps itself up to date from the other.  If one goes down, the
>> other can be pointed to for updates AND commits to minimize disruption.
>>
>> Right now, we actually already have two trunks.  Now, I'm pleased to
>> announce that new-trunk running on box4.squeak.org is now a *full-copy* of
>> old-trunk on box2.  (Before it was only trunk, now it includes Inbox, Etoys,
>> etc.).  Using newer and better code and VM and also Magma, this copy of
>> trunk was originally brought up simply to provide MC method history directly
>> into the IDE, but now I can see its role being to improve trunk process
>> stability so that community development can be continuous until it
>> eventually becomes the defacto trunk (e.g., running source.squeak.org).
>>
>> There are other side-benefits too, like the ability to move or upgrade the
>> trunk without a service interruption.  We are assured to be ready to move to
>> a different server on a moments notice, e.g., break the link with Hetzner.
>>
>> So, I guess I'm proposing that we have some elements in the image "aware" of
>> a second trunk.  But before wrangling out exactly what form that awareness
>> would take, what do you think so far?
>
> I think before any person pitches in with any suggestion, that person
> should go read up on handling state in a distributed system. (Because
> having a second copy in a kind've active-active replication thing is
> exactly a distributed system.) It is _not_easy_. (And "Never go to sea
> with two chronometers; take one or three".) Here's a good starting
> point: http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions
>
> frank
>

Reply | Threaded
Open this post in threaded view
|

Re: trunk process resilience

David T. Lewis
In reply to this post by Chris Muller-4
On Thu, Nov 07, 2013 at 03:07:59PM -0600, Chris Muller wrote:

> Lately we've had some problems with the SqueakSource server that supports
> our vital trunk process.  Ken and I burned several hours on it this week.
>  The experience has caused me to consider an idea for improved continuity
> of our trunk repository.
>
> Very simply, it's a second running copy of trunk (and inbox, et al).  Each
> instance keeps itself up to date from the other.  If one goes down, the
> other can be pointed to for updates AND commits to minimize disruption.
>
> Right now, we actually already have two trunks.  Now, I'm pleased to
> announce that new-trunk running on box4.squeak.org is now a *full-copy* of
> old-trunk on box2.  (Before it was only trunk, now it includes Inbox,
> Etoys, etc.).  Using newer and better code and VM and also Magma, this copy
> of trunk was originally brought up simply to provide MC method history
> directly into the IDE, but now I can see its role being to improve trunk
> process stability so that community development can be continuous until it
> eventually becomes the defacto trunk (e.g., running source.squeak.org).
>
> There are other side-benefits too, like the ability to move or upgrade the
> trunk without a service interruption.  We are assured to be ready to move
> to a different server on a moments notice, e.g., break the link with
> Hetzner.
>

I like the idea of building some resilience into the SqueakSource servers.
I also like the idea of using Magma to support this, because I know that
Magma has been used to address similar issues on much larger scale systems.

I do have some concerns of a non-technical nature:

1) From an operational point of view, we need to keep our systems as simple
as possible. There are very few people supporting the servers, and their
availability comes and goes over time, so we need to keep things simple
enough that any box-admins person can always figure out how to get things
running even if the expert is not available.

2) We need to be careful not to add more failure modes than we remove. This
is a painfully common mistake, in which people add high availability features
to an existing system with the result that new failure modes are introduced
that turn out to be worse than the failure modes that they were attempting
to mitigate.

As an example, I would point to the recent downtime on SmalltalkHub
(see the excellent recap provided by Philippe Marschall at
https://github.com/blog/1346-network-problems-last-friday). The system
had availability problems for an extended period of time, and the cause
was a (human error induced) failure in some redundant networking gear.
The high availability networking introduced additional failure modes, and
the combination of human error and system complexity reduced the resilience
of the system as a whole.

This is meant only as a cautionary note. I really *do* like the idea of
building in some redundancy, and I think that the work you (Chris) have
done with box4.squeak.org might be a good way to do it.

>
> So, I guess I'm proposing that we have some elements in the image "aware"
> of a second trunk.  But before wrangling out exactly what form that
> awareness would take, what do you think so far?
>

We should keep any changes in the image to a minimum, but the general idea
sounds good to me.

Dave


Reply | Threaded
Open this post in threaded view
|

Re: trunk process resilience

Chris Muller-3
Thanks for the great discussion Dave.

> I like the idea of building some resilience into the SqueakSource servers.
> I also like the idea of using Magma to support this, because I know that
> Magma has been used to address similar issues on much larger scale systems.

We don't need to use Magma at all to accomplish the redundancy I'm proposing.

The work I did to use a Magma backend for SqueakSource on box4 is
solely for reliable persistence and to support the history function in
the IDE.  Nothing more.  Its HA function is not being used at all, in
fact Magma is being used by the webserver in "local" (direct-connect,
single-user) mode.

> I do have some concerns of a non-technical nature:
>
> 1) From an operational point of view, we need to keep our systems as simple
> as possible. There are very few people supporting the servers, and their
> availability comes and goes over time, so we need to keep things simple
> enough that any box-admins person can always figure out how to get things
> running even if the expert is not available.

Agreed.

> 2) We need to be careful not to add more failure modes than we remove. This
> is a painfully common mistake, in which people add high availability features
> to an existing system with the result that new failure modes are introduced
> that turn out to be worse than the failure modes that they were attempting
> to mitigate.

Agreed.

> As an example, I would point to the recent downtime on SmalltalkHub
> (see the excellent recap provided by Philippe Marschall at
> https://github.com/blog/1346-network-problems-last-friday). The system
> had availability problems for an extended period of time, and the cause
> was a (human error induced) failure in some redundant networking gear.
> The high availability networking introduced additional failure modes, and
> the combination of human error and system complexity reduced the resilience
> of the system as a whole.

You said SmalltalkHub but the link was about GitHub (an interesting
story, nonetheless).  Is Philippe Marschall working at GitHub?

> This is meant only as a cautionary note. I really *do* like the idea of
> building in some redundancy, and I think that the work you (Chris) have
> done with box4.squeak.org might be a good way to do it.
>
>>
>> So, I guess I'm proposing that we have some elements in the image "aware"
>> of a second trunk.  But before wrangling out exactly what form that
>> awareness would take, what do you think so far?
>>
>
> We should keep any changes in the image to a minimum, but the general idea
> sounds good to me.

I'll submit a proposal to the Inbox which will clarify exactly what
I'm proposing.

Thanks.