What to do with our failing CI?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

What to do with our failing CI?

Denis Kudriashov
Hello.

We have a failing CI for quite a long time (maybe few months). And out of that there are always new flaky tests failing our PRs time to time. We did fix some of them but I do not think it is possible to avoid them at all. 

I propose a simple approach/convention to quickly recover CI from such appearing intermittent problems. When we detect a flaky test let's do two simple PRs:

1. [Disable Flaky Test] PR will skip problem tests and flag them to be easily found in the image.
testWithIntermitentIssue
    self flag: #flakyTest
    self skip.
    "the rest of test code"
This PR is supposed to be quickly merged to avoid failures in new PRs
  
2. [Enable Flaky Test] PR will enable tests back.
It will record the issue and track the current "flaky state".

For example I created two PRs for Zinc tests: 
The enable PR here will be always red until we integrate a fix.

Of course some issues are trivial to fix like increasing the allowed time for the test. And we should just push the fix. But when it is not clear what to do it is better to remove the case from the overall CI and localize the issue in concrete PR. So expert devs could look at the problem without interrupting the contribution of other people.   

I think it's a very easy approach to follow by anyone. And it can be even automated. 
That's my idea.
Best regards,
Denis

Reply | Threaded
Open this post in threaded view
|

Re: What to do with our failing CI?

Marcus Denker-4
Hi,

Yes, I like the idea.
 
There is the downside that it might take pressure away from fixing issues that are behind. 

For the current problem, Pablo found out that it has to do with some network issue that the admins will fix.
So here it makes sense to now put the test on skip as we know what the reason for the failure is (and that it will be fixed).

Marcus

On 7 Jun 2020, at 17:23, Denis Kudriashov <[hidden email]> wrote:

Hello.

We have a failing CI for quite a long time (maybe few months). And out of that there are always new flaky tests failing our PRs time to time. We did fix some of them but I do not think it is possible to avoid them at all. 

I propose a simple approach/convention to quickly recover CI from such appearing intermittent problems. When we detect a flaky test let's do two simple PRs:

1. [Disable Flaky Test] PR will skip problem tests and flag them to be easily found in the image.
testWithIntermitentIssue
    self flag: #flakyTest
    self skip.
    "the rest of test code"
This PR is supposed to be quickly merged to avoid failures in new PRs
  
2. [Enable Flaky Test] PR will enable tests back.
It will record the issue and track the current "flaky state".

For example I created two PRs for Zinc tests: 
The enable PR here will be always red until we integrate a fix.

Of course some issues are trivial to fix like increasing the allowed time for the test. And we should just push the fix. But when it is not clear what to do it is better to remove the case from the overall CI and localize the issue in concrete PR. So expert devs could look at the problem without interrupting the contribution of other people.   

I think it's a very easy approach to follow by anyone. And it can be even automated. 
That's my idea.
Best regards,
Denis