Hi,
Since months now there are a lot of random failure on the CI making it hard to work. There is different kind of failures: - Network problems - Failing tests - Incomprehensible problems Now I don't see much failure due to Network. I suppose the Inria infrastructure improved. Failing tests were corrected those past months and we see less and less of them. Now the big problem are the incomprehensible crashes such as "The workspace was not found" or "FileDoesNotExistException" or "pharo-vm/ is already present". We just found the problem :) During the validation of the Bootstrap multiple tests are launched on OSX/Windows/linux in parallel. Each task is on a different slave of the Jenkins. But, apparently we discovered that two slaves could have the same disk. Usually it does not cause any trouble since a job is only run by one slave. But in this particular case, two slaves can be used by the same job and mess with the resources of each other. We highlighted the problem by adding logs to the CI. Now when we launch tests we create a file with the name of the task. Today we got a crash and in the log we see that the same workspace has two of those files, proving that they are executed on the same disk, in the same folder : […] -rw-rw-r-- 1 ci ci 0 Jun 19 16:01 Kernel-tests-unix-32 […] -rw-rw-r-- 1 ci ci 0 Jun 19 16:01 Tests-unix-32 As a solution we will execute the tests inside a subfolder with the name of the task and it should reduce a lot the number of problems. Have a nice day :) -- Cyril Ferlicot https://ferlicot.fr |
Well done. Great work.
> On 19 Jun 2018, at 16:55, Cyril Ferlicot D. <[hidden email]> wrote: > > Hi, > > Since months now there are a lot of random failure on the CI making it > hard to work. > > There is different kind of failures: > - Network problems > - Failing tests > - Incomprehensible problems > > Now I don't see much failure due to Network. I suppose the Inria > infrastructure improved. > > Failing tests were corrected those past months and we see less and less > of them. > > Now the big problem are the incomprehensible crashes such as "The > workspace was not found" or "FileDoesNotExistException" or "pharo-vm/ is > already present". > > We just found the problem :) > > During the validation of the Bootstrap multiple tests are launched on > OSX/Windows/linux in parallel. Each task is on a different slave of the > Jenkins. But, apparently we discovered that two slaves could have the > same disk. Usually it does not cause any trouble since a job is only run > by one slave. But in this particular case, two slaves can be used by the > same job and mess with the resources of each other. > > We highlighted the problem by adding logs to the CI. Now when we launch > tests we create a file with the name of the task. > > Today we got a crash and in the log we see that the same workspace has > two of those files, proving that they are executed on the same disk, in > the same folder : > > […] > -rw-rw-r-- 1 ci ci 0 Jun 19 16:01 Kernel-tests-unix-32 > […] > -rw-rw-r-- 1 ci ci 0 Jun 19 16:01 Tests-unix-32 > > As a solution we will execute the tests inside a subfolder with the name > of the task and it should reduce a lot the number of problems. > > Have a nice day :) > > -- > Cyril Ferlicot > https://ferlicot.fr > |
In reply to this post by CyrilFerlicot
Hi Cyril,
On Tue, 19 Jun 2018 at 16:55, Cyril Ferlicot D. <[hidden email]> wrote: > > Hi, > > Since months now there are a lot of random failure on the CI making it > hard to work. > > There is different kind of failures: > - Network problems > - Failing tests > - Incomprehensible problems > > Now I don't see much failure due to Network. I suppose the Inria > infrastructure improved. > > Failing tests were corrected those past months and we see less and less > of them. > > Now the big problem are the incomprehensible crashes such as "The > workspace was not found" or "FileDoesNotExistException" or "pharo-vm/ is > already present". > > We just found the problem :) > > During the validation of the Bootstrap multiple tests are launched on > OSX/Windows/linux in parallel. Each task is on a different slave of the > Jenkins. But, apparently we discovered that two slaves could have the > same disk. Usually it does not cause any trouble since a job is only run > by one slave. But in this particular case, two slaves can be used by the > same job and mess with the resources of each other. > > We highlighted the problem by adding logs to the CI. Now when we launch > tests we create a file with the name of the task. > > Today we got a crash and in the log we see that the same workspace has > two of those files, proving that they are executed on the same disk, in > the same folder : > > […] > -rw-rw-r-- 1 ci ci 0 Jun 19 16:01 Kernel-tests-unix-32 > […] > -rw-rw-r-- 1 ci ci 0 Jun 19 16:01 Tests-unix-32 > > As a solution we will execute the tests inside a subfolder with the name > of the task and it should reduce a lot the number of problems. Great work! It would be nice if all the temporary files were in a single folder. Cleaning the environment then becomes a matter of deleting that one folder. (At the moment, there is a vm installed in to the root directory, a vmtarget directory (I'm guilty here) and bootstrap-cache). Slightly off-topic, but I'm also wondering why the test scripts download the VM again? Why not just use the one that is already in vmtarget? > Have a nice day :) This definitely helps :-) > -- > Cyril Ferlicot > https://ferlicot.fr Thanks, Alistair |
On 19/06/2018 17:02, Alistair Grant wrote:
> Hi Cyril, > Great work! > Thank you. Also thanks to Guille, Vincent, Marcus that helped to find the problem. > It would be nice if all the temporary files were in a single folder. > Cleaning the environment then becomes a matter of deleting that one > folder. (At the moment, there is a vm installed in to the root > directory, a vmtarget directory (I'm guilty here) and > bootstrap-cache). > Maybe yes :) > Slightly off-topic, but I'm also wondering why the test scripts > download the VM again? Why not just use the one that is already in > vmtarget? > Because it is a vm linux and tests are executed on OSX and Windows too. > > > This definitely helps :-) > > > Thanks, > Alistair > -- Cyril Ferlicot https://ferlicot.fr |
Thanks Cyril, thanks for pushing. I've only given you permissions to re-run the build, so keep yourself and Vincent all the credit! > Slightly off-topic, but I'm also wondering why the test scripts > download the VM again? Why not just use the one that is already in > vmtarget? Yes, on one side there is what Cyril says, we need to download a new VM on each new slave, because - we build the image on a single slave (unix) - we then transfer it to different slaves (mac, windows) to run the tests. This is like that because originally running the bootstrap was about three times as expensive. Right now we are on 7minutes to create the minimal image and a couple more to load the rest on top of it. There is another thing also: we always assumed that slaves do not share disk, so either we re-downloaded a new unix vm or we shared it. Apparently our assumptions weren't right? On Tue, Jun 19, 2018 at 5:06 PM Cyril Ferlicot D. <[hidden email]> wrote: On 19/06/2018 17:02, Alistair Grant wrote:
|
Hi Cyril & Guille,
> On Tue, Jun 19, 2018 at 5:06 PM Cyril Ferlicot D. <[hidden email]> wrote: >> >> > Slightly off-topic, but I'm also wondering why the test scripts >> > download the VM again? Why not just use the one that is already in >> > vmtarget? >> > >> >> Because it is a vm linux and tests are executed on OSX and Windows too. On Tue, 19 Jun 2018 at 18:49, Guillermo Polito <[hidden email]> wrote: > > Thanks Cyril, thanks for pushing. I've only given you permissions to re-run the build, so keep yourself and Vincent all the credit! > > > Slightly off-topic, but I'm also wondering why the test scripts > > download the VM again? Why not just use the one that is already in > > vmtarget? > > Yes, on one side there is what Cyril says, we need to download a new VM on each new slave, because > - we build the image on a single slave (unix) > - we then transfer it to different slaves (mac, windows) to run the tests. > > This is like that because originally running the bootstrap was about three times as expensive. > Right now we are on 7minutes to create the minimal image and a couple more to load the rest on top of it. Thanks for your explanations, I hadn't considered these requirements. Apart from avoiding duplication and resource waste by re-running the bootstrap, testing a single image does a sanity check that we haven't broken cross-platform compatibility in some way. Thanks again, Alistair > There is another thing also: we always assumed that slaves do not share disk, so either we re-downloaded a new unix vm or we shared it. > Apparently our assumptions weren't right? |
I've integrated Cyril's PR Let's hope this gets the CI in good shape :) Thanks again! On Tue, Jun 19, 2018 at 7:27 PM Alistair Grant <[hidden email]> wrote: Hi Cyril & Guille,
|
In reply to this post by CyrilFerlicot
On 19 June 2018 at 22:55, Cyril Ferlicot D. <[hidden email]> wrote: Hi, That sort of outside-the-box confounding factor is difficult and frustrating to track down. Great work guys. cheers -ben |
Free forum by Nabble | Edit this page |