Lately we've had some problems with the SqueakSource server that supports our vital trunk process. Ken and I burned several hours on it this week. The experience has caused me to consider an idea for improved continuity of our trunk repository.
Very simply, it's a second running copy of trunk (and inbox, et al). Each instance keeps itself up to date from the other. If one goes down, the other can be pointed to for updates AND commits to minimize disruption.
Right now, we actually already have two trunks. Now, I'm pleased to announce that new-trunk running on box4.squeak.org is now a *full-copy* of old-trunk on box2. (Before it was only trunk, now it includes Inbox, Etoys, etc.). Using newer and better code and VM and also Magma, this copy of trunk was originally brought up simply to provide MC method history directly into the IDE, but now I can see its role being to improve trunk process stability so that community development can be continuous until it eventually becomes the defacto trunk (e.g., running source.squeak.org).
There are other side-benefits too, like the ability to move or upgrade the trunk without a service interruption. We are assured to be ready to move to a different server on a moments notice, e.g., break the link with Hetzner.
So, I guess I'm proposing that we have some elements in the image "aware" of a second trunk. But before wrangling out exactly what form that awareness would take, what do you think so far?
|
On 7 November 2013 21:07, Chris Muller <[hidden email]> wrote:
> Lately we've had some problems with the SqueakSource server that supports > our vital trunk process. Ken and I burned several hours on it this week. > The experience has caused me to consider an idea for improved continuity of > our trunk repository. > > Very simply, it's a second running copy of trunk (and inbox, et al). Each > instance keeps itself up to date from the other. If one goes down, the > other can be pointed to for updates AND commits to minimize disruption. > > Right now, we actually already have two trunks. Now, I'm pleased to > announce that new-trunk running on box4.squeak.org is now a *full-copy* of > old-trunk on box2. (Before it was only trunk, now it includes Inbox, Etoys, > etc.). Using newer and better code and VM and also Magma, this copy of > trunk was originally brought up simply to provide MC method history directly > into the IDE, but now I can see its role being to improve trunk process > stability so that community development can be continuous until it > eventually becomes the defacto trunk (e.g., running source.squeak.org). > > There are other side-benefits too, like the ability to move or upgrade the > trunk without a service interruption. We are assured to be ready to move to > a different server on a moments notice, e.g., break the link with Hetzner. > > So, I guess I'm proposing that we have some elements in the image "aware" of > a second trunk. But before wrangling out exactly what form that awareness > would take, what do you think so far? I think before any person pitches in with any suggestion, that person should go read up on handling state in a distributed system. (Because having a second copy in a kind've active-active replication thing is exactly a distributed system.) It is _not_easy_. (And "Never go to sea with two chronometers; take one or three".) Here's a good starting point: http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions frank |
It's nice to read someone thinking about the issues faced by networked
programs. In fact, the little experiment he does (under the heading "A simple distributed system"), is exactly what one of Magma's HA test-cases performs -- multiple clients perform rapid-fire commits as fast as they can, counting upward, while the servers in the HA cluster undergo various role changes due to HA events like arbitrarily killing one of the servers with quitPrimitive. It's quite a piece [1]. Thankfully, none of that applies to what's being proposed here, the operations needed to achieve a mutual backup are idempotent -- simply a package copy from remote to local using the existing MCRepository>>#copyAllFrom:. So, it uses existing error-handling too, what could go wrong? Under normal usage, the same person would not commit two different UUID versions of a package, but with the same exact name, to each repository. But, even if they did, it's no different than when that happens today between projects which, themselves, are simply different repositories. I've seen how fragile and unsustainable our source.squeak.org server is. I want to inform the community what I've done and solicit pragmatic discussion on how we can get more out of it. Thanks. [1] -- http://wiki.squeak.org/squeak/6101 On Thu, Nov 7, 2013 at 3:33 PM, Frank Shearar <[hidden email]> wrote: > On 7 November 2013 21:07, Chris Muller <[hidden email]> wrote: >> Lately we've had some problems with the SqueakSource server that supports >> our vital trunk process. Ken and I burned several hours on it this week. >> The experience has caused me to consider an idea for improved continuity of >> our trunk repository. >> >> Very simply, it's a second running copy of trunk (and inbox, et al). Each >> instance keeps itself up to date from the other. If one goes down, the >> other can be pointed to for updates AND commits to minimize disruption. >> >> Right now, we actually already have two trunks. Now, I'm pleased to >> announce that new-trunk running on box4.squeak.org is now a *full-copy* of >> old-trunk on box2. (Before it was only trunk, now it includes Inbox, Etoys, >> etc.). Using newer and better code and VM and also Magma, this copy of >> trunk was originally brought up simply to provide MC method history directly >> into the IDE, but now I can see its role being to improve trunk process >> stability so that community development can be continuous until it >> eventually becomes the defacto trunk (e.g., running source.squeak.org). >> >> There are other side-benefits too, like the ability to move or upgrade the >> trunk without a service interruption. We are assured to be ready to move to >> a different server on a moments notice, e.g., break the link with Hetzner. >> >> So, I guess I'm proposing that we have some elements in the image "aware" of >> a second trunk. But before wrangling out exactly what form that awareness >> would take, what do you think so far? > > I think before any person pitches in with any suggestion, that person > should go read up on handling state in a distributed system. (Because > having a second copy in a kind've active-active replication thing is > exactly a distributed system.) It is _not_easy_. (And "Never go to sea > with two chronometers; take one or three".) Here's a good starting > point: http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-perils-of-network-partitions > > frank > |
In reply to this post by Chris Muller-4
On Thu, Nov 07, 2013 at 03:07:59PM -0600, Chris Muller wrote:
> Lately we've had some problems with the SqueakSource server that supports > our vital trunk process. Ken and I burned several hours on it this week. > The experience has caused me to consider an idea for improved continuity > of our trunk repository. > > Very simply, it's a second running copy of trunk (and inbox, et al). Each > instance keeps itself up to date from the other. If one goes down, the > other can be pointed to for updates AND commits to minimize disruption. > > Right now, we actually already have two trunks. Now, I'm pleased to > announce that new-trunk running on box4.squeak.org is now a *full-copy* of > old-trunk on box2. (Before it was only trunk, now it includes Inbox, > Etoys, etc.). Using newer and better code and VM and also Magma, this copy > of trunk was originally brought up simply to provide MC method history > directly into the IDE, but now I can see its role being to improve trunk > process stability so that community development can be continuous until it > eventually becomes the defacto trunk (e.g., running source.squeak.org). > > There are other side-benefits too, like the ability to move or upgrade the > trunk without a service interruption. We are assured to be ready to move > to a different server on a moments notice, e.g., break the link with > Hetzner. > I like the idea of building some resilience into the SqueakSource servers. I also like the idea of using Magma to support this, because I know that Magma has been used to address similar issues on much larger scale systems. I do have some concerns of a non-technical nature: 1) From an operational point of view, we need to keep our systems as simple as possible. There are very few people supporting the servers, and their availability comes and goes over time, so we need to keep things simple enough that any box-admins person can always figure out how to get things running even if the expert is not available. 2) We need to be careful not to add more failure modes than we remove. This is a painfully common mistake, in which people add high availability features to an existing system with the result that new failure modes are introduced that turn out to be worse than the failure modes that they were attempting to mitigate. As an example, I would point to the recent downtime on SmalltalkHub (see the excellent recap provided by Philippe Marschall at https://github.com/blog/1346-network-problems-last-friday). The system had availability problems for an extended period of time, and the cause was a (human error induced) failure in some redundant networking gear. The high availability networking introduced additional failure modes, and the combination of human error and system complexity reduced the resilience of the system as a whole. This is meant only as a cautionary note. I really *do* like the idea of building in some redundancy, and I think that the work you (Chris) have done with box4.squeak.org might be a good way to do it. > > So, I guess I'm proposing that we have some elements in the image "aware" > of a second trunk. But before wrangling out exactly what form that > awareness would take, what do you think so far? > We should keep any changes in the image to a minimum, but the general idea sounds good to me. Dave |
Thanks for the great discussion Dave.
> I like the idea of building some resilience into the SqueakSource servers. > I also like the idea of using Magma to support this, because I know that > Magma has been used to address similar issues on much larger scale systems. We don't need to use Magma at all to accomplish the redundancy I'm proposing. The work I did to use a Magma backend for SqueakSource on box4 is solely for reliable persistence and to support the history function in the IDE. Nothing more. Its HA function is not being used at all, in fact Magma is being used by the webserver in "local" (direct-connect, single-user) mode. > I do have some concerns of a non-technical nature: > > 1) From an operational point of view, we need to keep our systems as simple > as possible. There are very few people supporting the servers, and their > availability comes and goes over time, so we need to keep things simple > enough that any box-admins person can always figure out how to get things > running even if the expert is not available. Agreed. > 2) We need to be careful not to add more failure modes than we remove. This > is a painfully common mistake, in which people add high availability features > to an existing system with the result that new failure modes are introduced > that turn out to be worse than the failure modes that they were attempting > to mitigate. Agreed. > As an example, I would point to the recent downtime on SmalltalkHub > (see the excellent recap provided by Philippe Marschall at > https://github.com/blog/1346-network-problems-last-friday). The system > had availability problems for an extended period of time, and the cause > was a (human error induced) failure in some redundant networking gear. > The high availability networking introduced additional failure modes, and > the combination of human error and system complexity reduced the resilience > of the system as a whole. You said SmalltalkHub but the link was about GitHub (an interesting story, nonetheless). Is Philippe Marschall working at GitHub? > This is meant only as a cautionary note. I really *do* like the idea of > building in some redundancy, and I think that the work you (Chris) have > done with box4.squeak.org might be a good way to do it. > >> >> So, I guess I'm proposing that we have some elements in the image "aware" >> of a second trunk. But before wrangling out exactly what form that >> awareness would take, what do you think so far? >> > > We should keep any changes in the image to a minimum, but the general idea > sounds good to me. I'll submit a proposal to the Inbox which will clarify exactly what I'm proposing. Thanks. |
Free forum by Nabble | Edit this page |