Smalltalk › Pharo › Pharo Smalltalk Developers

What to do with our failing CI?

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

2 messages Options

Denis Kudriashov

What to do with our failing CI?

Hello.

We have a failing CI for quite a long time (maybe few months). And out of that there are always new flaky tests failing our PRs time to time. We did fix some of them but I do not think it is possible to avoid them at all.

I propose a simple approach/convention to quickly recover CI from such appearing intermittent problems. When we detect a flaky test let's do two simple PRs:

1. [Disable Flaky Test] PR will skip problem tests and flag them to be easily found in the image.

testWithIntermitentIssue

self flag: #flakyTest

self skip.

"the rest of test code"

This PR is supposed to be quickly merged to avoid failures in new PRs

2. [Enable Flaky Test] PR will enable tests back.

It will record the issue and track the current "flaky state".

For example I created two PRs for Zinc tests:

The enable PR here will be always red until we integrate a fix.

Of course some issues are trivial to fix like increasing the allowed time for the test. And we should just push the fix. But when it is not clear what to do it is better to remove the case from the overall CI and localize the issue in concrete PR. So expert devs could look at the problem without interrupting the contribution of other people.

I think it's a very easy approach to follow by anyone. And it can be even automated.

That's my idea.

Best regards,

Denis

Marcus Denker-4

Re: What to do with our failing CI?

Hi,

Yes, I like the idea.

There is the downside that it might take pressure away from fixing issues that are behind.

For the current problem, Pablo found out that it has to do with some network issue that the admins will fix.

So here it makes sense to now put the test on skip as we know what the reason for the failure is (and that it will be fixed).

Marcus

On 7 Jun 2020, at 17:23, Denis Kudriashov <[hidden email]> wrote:

Hello.

We have a failing CI for quite a long time (maybe few months). And out of that there are always new flaky tests failing our PRs time to time. We did fix some of them but I do not think it is possible to avoid them at all.

I propose a simple approach/convention to quickly recover CI from such appearing intermittent problems. When we detect a flaky test let's do two simple PRs:

1. [Disable Flaky Test] PR will skip problem tests and flag them to be easily found in the image.
testWithIntermitentIssue
self flag: #flakyTest
self skip.
"the rest of test code"
This PR is supposed to be quickly merged to avoid failures in new PRs

2. [Enable Flaky Test] PR will enable tests back.
It will record the issue and track the current "flaky state".

For example I created two PRs for Zinc tests:
- [Disable Flaky Tests] Disable two Zinc flaky tests
- [Enable Flaky Tests] Enable two Zinc flaky tests
The enable PR here will be always red until we integrate a fix.

Of course some issues are trivial to fix like increasing the allowed time for the test. And we should just push the fix. But when it is not clear what to do it is better to remove the case from the overall CI and localize the issue in concrete PR. So expert devs could look at the problem without interrupting the contribution of other people.

I think it's a very easy approach to follow by anyone. And it can be even automated.
That's my idea.
Best regards,
Denis