Re: [ci-announces] [CI] July 28, 2017 Incident Report

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [ci-announces] [CI] July 28, 2017 Incident Report

David T. Lewis
Kudos for providing a good incident report and explanation of the cause of the problem.

I know little about this actual incident, but that it not important. It is really good
to see constructive responses to problems like this with a focus on inprovement and
problem prevention in the future.

Dave

On Thu, Sep 21, 2017 at 03:35:19PM +0200, Christophe Demarey wrote:

> July 28, 2017 Incident Report
>
> This summer, we experienced a hard incident on the Continuous Integration service infrastructure.Today we???re providing an incident report that details the nature of the incident and our response.
> We understand this service issue has impacted all Inria developers using the CI service and their valued time, and we apologize to everyone who was affected.
>
> Issue summary
> Due to a manipulation error, a lot of Virtual Machines (VM) hosted on CloudStack were destroyed and not recoverable!
> Jenkins servers are not concerned by this issue, so all Jenkins jobs and history are safe.
>
> Root cause
> After the successful migration of Jenkins servers and security fixes (July 18-19, 2017), a few projects were not able to reach their slaves hosted on the CI build farm. This problem was due to a synchronization problem between the CI database and CloudStack (powering the CI build farm) having its own way to manage projects and users (through domains).
> An attempt to reproduce and debug this problem on the qualification infrastructure failed. So, we added some logging on the production infrastructure. To avoid troubles to the production infrastructure, we limited the synchronization to one project. It was the mistake!
> The synchronization of one project led to the deletion of all other CloudStack domains (i.e. projects). Indeed, the synchronization code expected to get the full environment (all CI projects) and if it finds a domain not bound to a CI project, it deletes it...
> The synchronization process was aborted before its termination but it was too late. Some user Virtual Machines were still alive during some hours but were finally purged by CloudStack.
> It means we lost most VM and templates hosted on the CI build farm.
>
> Resolution and recovery
> It was impossible to recover destroyed virtual machines. CloudStack is configured to keep VM data 24 hours before actually destroying it but it does not work when the domain hosting the VM is destroyed.
> Primary storage hosting running VM is a high-performance and very expensive storage. That???s why the CI team chose (at the CI service setup) to do not backup VM but rather to rely on both the expunge delay and the snapshot / template mechanism to save VM state. This mechanism was useless in relation to this incident.
> Templates and snapshots are hosted on the secondary storage that is a redundant storage in two different buildings to ensure data reliability and recovery. The incident led CloudStack to perform a ?? clean ?? deletion of all the domain data including templates. That???s why they also became unavailable.
>
> We were able to rebuild all domains from the CI database but CI service users had to create new VM to replace the destroyed ones.
>
> Corrective and preventative measures
> All members of the CI team (DSI, SED) worked and are still working all together to find the best solution to mitigate the incident and prevent same situations in the future.
> The synchronization code responsible of the deletion of CloudStack domain has been deactivated. CloudStack domain deletion is a critical action and will no longer be automated. Deletions will be reviewed and approved by the CI team before being completed.
> This incident showed us that backup mechanism in place are not strong enough and we are now evaluating the cost to backup, with history:
> all VM, snapshots and templates or
> all snapshots and templates.
> We are also working on providing a way to download templates created on CloudStack so that you can easily get a copy of them. We encourage you to create templates for virtual machines that are time consuming to set up from scratch.
>
>
> Sincerely,
> The CI Team

Reply | Threaded
Open this post in threaded view
|

Re: [ci-announces] [CI] July 28, 2017 Incident Report

Stephane Ducasse-3
+1


On Fri, Sep 22, 2017 at 3:07 AM, David T. Lewis <[hidden email]> wrote:

> Kudos for providing a good incident report and explanation of the cause of the problem.
>
> I know little about this actual incident, but that it not important. It is really good
> to see constructive responses to problems like this with a focus on inprovement and
> problem prevention in the future.
>
> Dave
>
> On Thu, Sep 21, 2017 at 03:35:19PM +0200, Christophe Demarey wrote:
>> July 28, 2017 Incident Report
>>
>> This summer, we experienced a hard incident on the Continuous Integration service infrastructure.Today we???re providing an incident report that details the nature of the incident and our response.
>> We understand this service issue has impacted all Inria developers using the CI service and their valued time, and we apologize to everyone who was affected.
>>
>> Issue summary
>> Due to a manipulation error, a lot of Virtual Machines (VM) hosted on CloudStack were destroyed and not recoverable!
>> Jenkins servers are not concerned by this issue, so all Jenkins jobs and history are safe.
>>
>> Root cause
>> After the successful migration of Jenkins servers and security fixes (July 18-19, 2017), a few projects were not able to reach their slaves hosted on the CI build farm. This problem was due to a synchronization problem between the CI database and CloudStack (powering the CI build farm) having its own way to manage projects and users (through domains).
>> An attempt to reproduce and debug this problem on the qualification infrastructure failed. So, we added some logging on the production infrastructure. To avoid troubles to the production infrastructure, we limited the synchronization to one project. It was the mistake!
>> The synchronization of one project led to the deletion of all other CloudStack domains (i.e. projects). Indeed, the synchronization code expected to get the full environment (all CI projects) and if it finds a domain not bound to a CI project, it deletes it...
>> The synchronization process was aborted before its termination but it was too late. Some user Virtual Machines were still alive during some hours but were finally purged by CloudStack.
>> It means we lost most VM and templates hosted on the CI build farm.
>>
>> Resolution and recovery
>> It was impossible to recover destroyed virtual machines. CloudStack is configured to keep VM data 24 hours before actually destroying it but it does not work when the domain hosting the VM is destroyed.
>> Primary storage hosting running VM is a high-performance and very expensive storage. That???s why the CI team chose (at the CI service setup) to do not backup VM but rather to rely on both the expunge delay and the snapshot / template mechanism to save VM state. This mechanism was useless in relation to this incident.
>> Templates and snapshots are hosted on the secondary storage that is a redundant storage in two different buildings to ensure data reliability and recovery. The incident led CloudStack to perform a ?? clean ?? deletion of all the domain data including templates. That???s why they also became unavailable.
>>
>> We were able to rebuild all domains from the CI database but CI service users had to create new VM to replace the destroyed ones.
>>
>> Corrective and preventative measures
>> All members of the CI team (DSI, SED) worked and are still working all together to find the best solution to mitigate the incident and prevent same situations in the future.
>> The synchronization code responsible of the deletion of CloudStack domain has been deactivated. CloudStack domain deletion is a critical action and will no longer be automated. Deletions will be reviewed and approved by the CI team before being completed.
>> This incident showed us that backup mechanism in place are not strong enough and we are now evaluating the cost to backup, with history:
>> all VM, snapshots and templates or
>> all snapshots and templates.
>> We are also working on providing a way to download templates created on CloudStack so that you can easily get a copy of them. We encourage you to create templates for virtual machines that are time consuming to set up from scratch.
>>
>>
>> Sincerely,
>> The CI Team
>