[CI] Cause of the random failures in the CI

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

[CI] Cause of the random failures in the CI

CyrilFerlicot
Hi,

Since months now there are a lot of random failure on the CI making it
hard to work.

There is different kind of failures:
- Network problems
- Failing tests
- Incomprehensible problems

Now I don't see much failure due to Network. I suppose the Inria
infrastructure improved.

Failing tests were corrected those past months and we see less and less
of them.

Now the big problem are the incomprehensible crashes such as "The
workspace was not found" or "FileDoesNotExistException" or "pharo-vm/ is
already present".

We just found the problem :)

During the validation of the Bootstrap multiple tests are launched on
OSX/Windows/linux in parallel. Each task is on a different slave of the
Jenkins. But, apparently we discovered that two slaves could have the
same disk. Usually it does not cause any trouble since a job is only run
by one slave. But in this particular case, two slaves can be used by the
same job and mess with the resources of each other.

We highlighted the problem by adding logs to the CI. Now when we launch
tests we create a file with the name of the task.

Today we got a crash and in the log we see that the same workspace has
two of those files, proving that they are executed on the same disk, in
the same folder :

[…]
-rw-rw-r-- 1 ci ci    0 Jun 19 16:01 Kernel-tests-unix-32
[…]
-rw-rw-r-- 1 ci ci    0 Jun 19 16:01 Tests-unix-32

As a solution we will execute the tests inside a subfolder with the name
of the task and it should reduce a lot the number of problems.

Have a nice day :)

--
Cyril Ferlicot
https://ferlicot.fr

Reply | Threaded
Open this post in threaded view
|

Re: [CI] Cause of the random failures in the CI

Sven Van Caekenberghe-2
Well done. Great work.

> On 19 Jun 2018, at 16:55, Cyril Ferlicot D. <[hidden email]> wrote:
>
> Hi,
>
> Since months now there are a lot of random failure on the CI making it
> hard to work.
>
> There is different kind of failures:
> - Network problems
> - Failing tests
> - Incomprehensible problems
>
> Now I don't see much failure due to Network. I suppose the Inria
> infrastructure improved.
>
> Failing tests were corrected those past months and we see less and less
> of them.
>
> Now the big problem are the incomprehensible crashes such as "The
> workspace was not found" or "FileDoesNotExistException" or "pharo-vm/ is
> already present".
>
> We just found the problem :)
>
> During the validation of the Bootstrap multiple tests are launched on
> OSX/Windows/linux in parallel. Each task is on a different slave of the
> Jenkins. But, apparently we discovered that two slaves could have the
> same disk. Usually it does not cause any trouble since a job is only run
> by one slave. But in this particular case, two slaves can be used by the
> same job and mess with the resources of each other.
>
> We highlighted the problem by adding logs to the CI. Now when we launch
> tests we create a file with the name of the task.
>
> Today we got a crash and in the log we see that the same workspace has
> two of those files, proving that they are executed on the same disk, in
> the same folder :
>
> […]
> -rw-rw-r-- 1 ci ci    0 Jun 19 16:01 Kernel-tests-unix-32
> […]
> -rw-rw-r-- 1 ci ci    0 Jun 19 16:01 Tests-unix-32
>
> As a solution we will execute the tests inside a subfolder with the name
> of the task and it should reduce a lot the number of problems.
>
> Have a nice day :)
>
> --
> Cyril Ferlicot
> https://ferlicot.fr
>


Reply | Threaded
Open this post in threaded view
|

Re: [CI] Cause of the random failures in the CI

alistairgrant
In reply to this post by CyrilFerlicot
Hi Cyril,

On Tue, 19 Jun 2018 at 16:55, Cyril Ferlicot D.
<[hidden email]> wrote:

>
> Hi,
>
> Since months now there are a lot of random failure on the CI making it
> hard to work.
>
> There is different kind of failures:
> - Network problems
> - Failing tests
> - Incomprehensible problems
>
> Now I don't see much failure due to Network. I suppose the Inria
> infrastructure improved.
>
> Failing tests were corrected those past months and we see less and less
> of them.
>
> Now the big problem are the incomprehensible crashes such as "The
> workspace was not found" or "FileDoesNotExistException" or "pharo-vm/ is
> already present".
>
> We just found the problem :)
>
> During the validation of the Bootstrap multiple tests are launched on
> OSX/Windows/linux in parallel. Each task is on a different slave of the
> Jenkins. But, apparently we discovered that two slaves could have the
> same disk. Usually it does not cause any trouble since a job is only run
> by one slave. But in this particular case, two slaves can be used by the
> same job and mess with the resources of each other.
>
> We highlighted the problem by adding logs to the CI. Now when we launch
> tests we create a file with the name of the task.
>
> Today we got a crash and in the log we see that the same workspace has
> two of those files, proving that they are executed on the same disk, in
> the same folder :
>
> […]
> -rw-rw-r-- 1 ci ci    0 Jun 19 16:01 Kernel-tests-unix-32
> […]
> -rw-rw-r-- 1 ci ci    0 Jun 19 16:01 Tests-unix-32
>
> As a solution we will execute the tests inside a subfolder with the name
> of the task and it should reduce a lot the number of problems.

Great work!

It would be nice if all the temporary files were in a single folder.
Cleaning the environment then becomes a matter of deleting that one
folder.  (At the moment, there is a vm installed in to the root
directory, a vmtarget directory (I'm guilty here) and
bootstrap-cache).

Slightly off-topic, but I'm also wondering why the test scripts
download the VM again?  Why not just use the one that is already in
vmtarget?


> Have a nice day :)

This definitely helps :-)

> --
> Cyril Ferlicot
> https://ferlicot.fr

Thanks,
Alistair

Reply | Threaded
Open this post in threaded view
|

Re: [CI] Cause of the random failures in the CI

CyrilFerlicot
On 19/06/2018 17:02, Alistair Grant wrote:
> Hi Cyril,

> Great work!
>

Thank you. Also thanks to Guille, Vincent, Marcus that helped to find
the problem.

> It would be nice if all the temporary files were in a single folder.
> Cleaning the environment then becomes a matter of deleting that one
> folder.  (At the moment, there is a vm installed in to the root
> directory, a vmtarget directory (I'm guilty here) and
> bootstrap-cache).
>

Maybe yes :)

> Slightly off-topic, but I'm also wondering why the test scripts
> download the VM again?  Why not just use the one that is already in
> vmtarget?
>

Because it is a vm linux and tests are executed on OSX and Windows too.

>
>
> This definitely helps :-)
>
>
> Thanks,
> Alistair
>


--
Cyril Ferlicot
https://ferlicot.fr

Reply | Threaded
Open this post in threaded view
|

Re: [CI] Cause of the random failures in the CI

Guillermo Polito
Thanks Cyril, thanks for pushing. I've only given you permissions to re-run the build, so keep yourself and Vincent all the credit!

> Slightly off-topic, but I'm also wondering why the test scripts
> download the VM again?  Why not just use the one that is already in
> vmtarget?

Yes, on one side there is what Cyril says, we need to download a new VM on each new slave, because
 - we build the image on a single slave (unix)
 - we then transfer it to different slaves (mac, windows) to run the tests.

This is like that because originally running the bootstrap was about three times as expensive.
Right now we are on 7minutes to create the minimal image and a couple more to load the rest on top of it.

There is another thing also: we always assumed that slaves do not share disk, so either we re-downloaded a new unix vm or we shared it.
Apparently our assumptions weren't right?

On Tue, Jun 19, 2018 at 5:06 PM Cyril Ferlicot D. <[hidden email]> wrote:
On 19/06/2018 17:02, Alistair Grant wrote:
> Hi Cyril,

> Great work!
>

Thank you. Also thanks to Guille, Vincent, Marcus that helped to find
the problem.

> It would be nice if all the temporary files were in a single folder.
> Cleaning the environment then becomes a matter of deleting that one
> folder.  (At the moment, there is a vm installed in to the root
> directory, a vmtarget directory (I'm guilty here) and
> bootstrap-cache).
>

Maybe yes :)

> Slightly off-topic, but I'm also wondering why the test scripts
> download the VM again?  Why not just use the one that is already in
> vmtarget?
>

Because it is a vm linux and tests are executed on OSX and Windows too.

>
>
> This definitely helps :-)
>
>
> Thanks,
> Alistair
>


--
Cyril Ferlicot
https://ferlicot.fr



--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: [CI] Cause of the random failures in the CI

alistairgrant
Hi Cyril & Guille,

> On Tue, Jun 19, 2018 at 5:06 PM Cyril Ferlicot D. <[hidden email]> wrote:
>>
>> > Slightly off-topic, but I'm also wondering why the test scripts
>> > download the VM again?  Why not just use the one that is already in
>> > vmtarget?
>> >
>>
>> Because it is a vm linux and tests are executed on OSX and Windows too.

On Tue, 19 Jun 2018 at 18:49, Guillermo Polito
<[hidden email]> wrote:

>
> Thanks Cyril, thanks for pushing. I've only given you permissions to re-run the build, so keep yourself and Vincent all the credit!
>
> > Slightly off-topic, but I'm also wondering why the test scripts
> > download the VM again?  Why not just use the one that is already in
> > vmtarget?
>
> Yes, on one side there is what Cyril says, we need to download a new VM on each new slave, because
>  - we build the image on a single slave (unix)
>  - we then transfer it to different slaves (mac, windows) to run the tests.
>
> This is like that because originally running the bootstrap was about three times as expensive.
> Right now we are on 7minutes to create the minimal image and a couple more to load the rest on top of it.

Thanks for your explanations, I hadn't considered these requirements.

Apart from avoiding duplication and resource waste by re-running the
bootstrap, testing a single image does a sanity check that we haven't
broken cross-platform compatibility in some way.

Thanks again,
Alistair


> There is another thing also: we always assumed that slaves do not share disk, so either we re-downloaded a new unix vm or we shared it.
> Apparently our assumptions weren't right?

Reply | Threaded
Open this post in threaded view
|

Re: [CI] Cause of the random failures in the CI

Guillermo Polito
I've integrated Cyril's PR

https://github.com/pharo-project/pharo/pull/1575

Let's hope this gets the CI in good shape :)

Thanks again!

On Tue, Jun 19, 2018 at 7:27 PM Alistair Grant <[hidden email]> wrote:
Hi Cyril & Guille,

> On Tue, Jun 19, 2018 at 5:06 PM Cyril Ferlicot D. <[hidden email]> wrote:
>>
>> > Slightly off-topic, but I'm also wondering why the test scripts
>> > download the VM again?  Why not just use the one that is already in
>> > vmtarget?
>> >
>>
>> Because it is a vm linux and tests are executed on OSX and Windows too.

On Tue, 19 Jun 2018 at 18:49, Guillermo Polito
<[hidden email]> wrote:
>
> Thanks Cyril, thanks for pushing. I've only given you permissions to re-run the build, so keep yourself and Vincent all the credit!
>
> > Slightly off-topic, but I'm also wondering why the test scripts
> > download the VM again?  Why not just use the one that is already in
> > vmtarget?
>
> Yes, on one side there is what Cyril says, we need to download a new VM on each new slave, because
>  - we build the image on a single slave (unix)
>  - we then transfer it to different slaves (mac, windows) to run the tests.
>
> This is like that because originally running the bootstrap was about three times as expensive.
> Right now we are on 7minutes to create the minimal image and a couple more to load the rest on top of it.

Thanks for your explanations, I hadn't considered these requirements.

Apart from avoiding duplication and resource waste by re-running the
bootstrap, testing a single image does a sanity check that we haven't
broken cross-platform compatibility in some way.

Thanks again,
Alistair


> There is another thing also: we always assumed that slaves do not share disk, so either we re-downloaded a new unix vm or we shared it.
> Apparently our assumptions weren't right?



--

   

Guille Polito

Research Engineer

Centre de Recherche en Informatique, Signal et Automatique de Lille

CRIStAL - UMR 9189

French National Center for Scientific Research - http://www.cnrs.fr


Web: http://guillep.github.io

Phone: +33 06 52 70 66 13

Reply | Threaded
Open this post in threaded view
|

Re: [CI] Cause of the random failures in the CI

Ben Coman
In reply to this post by CyrilFerlicot


On 19 June 2018 at 22:55, Cyril Ferlicot D. <[hidden email]> wrote:
Hi,

Since months now there are a lot of random failure on the CI making it
hard to work.

There is different kind of failures:
- Network problems
- Failing tests
- Incomprehensible problems

Now I don't see much failure due to Network. I suppose the Inria
infrastructure improved.

Failing tests were corrected those past months and we see less and less
of them.

Now the big problem are the incomprehensible crashes such as "The
workspace was not found" or "FileDoesNotExistException" or "pharo-vm/ is
already present".

We just found the problem :)

During the validation of the Bootstrap multiple tests are launched on
OSX/Windows/linux in parallel. Each task is on a different slave of the
Jenkins. But, apparently we discovered that two slaves could have the
same disk. Usually it does not cause any trouble since a job is only run
by one slave. But in this particular case, two slaves can be used by the
same job and mess with the resources of each other.

That sort of outside-the-box confounding factor is difficult and frustrating 
to track down. Great work guys.  

cheers -ben