Commit 7eeccd82 authored Nov 09, 2019 by intrigeri
Reboot isotesters after test completion and in a less racy way (#11295, #10601)

The previous implementation rebooted isotesters, via wrap_test_Tails_ISO_{name},
before starting a test_Tails_ISO_{name} job. It was inherently racy: between
"the time when reboot_job puts the node back online and wrap_test_* triggers
test_Tails_ISO_", and "the time when test_Tails_ISO_ actually starts", on my
local Jenkins for example, I see that 9 seconds have passed. So occasionally,
another job, that was already in queue, and that was allowed to run on the same
node, would be scheduled there in the meantime. Such a job can be, for example,
another wrap_test_*, which will itself reboot the node again. And then we got
two test_Tails_ISO_{name} jobs pinned to this node at the same time. One of them
got stuck in the queue until the one that won the race completed. Or, depending
on the exact timing, an isotester was left in offline mode because we had called
toggleOffline an odd number of times instead of an even one.

So, let's instead reboot isotesters at the end of a test_Tails_ISO_* build, in
a way that's not affected by the race condition described above. As a bonus,
this gets rid of the confusing wrap_test_Tails_ISO_* jobs, which will improve
the developer experience. Instead of one such job per branch, we get 2 new
jobs (reboot_node, keep_node_busy_during_cleanup) that are shared by all branches.

This new implementation is still inherently racy as well, in that it still
depends on a few arbitrary chosen sleep values. But as opposed to the previous
implementation:

 - If these sleep values are large enough, in principle this implementation
   should not be racy.

 - These sleep values can all safely be raised, up to the point when we always
   win the race conditions they're about. The only negative impact will be
   a minor increase in the test_Tails_ISO_* run time, which is negligible
   compared to the hours these jobs take anyway.

Ideally, we would not sleep, but instead somehow check if what we're waiting for
has happened; this might raise non-trivial issues about API access control and
corresponding security matters. I'm not spending time on this right now because
the long-term plan is to avoid rebooting isotesters between test runs (#17216),
either via more robust cleanup mechanisms, or as suggested by zen, by running
the test suite in containers.

Deployment notes: we need to ensure that the master has enough executors
configured so it can run as many reboot_job concurrent builds as necessary.
This job can't run in parallel for a given test_Tails_ISO_* build, so what
we strictly need is: as many executors on the master as we have nodes allowed to
run test_Tails_ISO_*, i.e. currently 5. When we get more nodes that can run
test_Tails_ISO_*, we'll need to bump this number.

This commit is based on unfinished work started by myself, and later improved by
bertagaz, on the feature/reboot-after_post_build_trigger branch.

This commit also does this:

 - Remove "DUMMY" parameter.

   I guess we used to need this in order to be able to use a parameterized
   trigger, because it needs at least one parameter. Given we now explicitly
   pass the UPSTREAMJOB_BUILD_NUMBER parameter, we can remove this
   DUMMY parameter.

 - Split sequences of shell commands into separate steps.

 - Document some build steps that are otherwise unaffected by this commit.
parent 012b9920
Show whitespace changes
Inline Side-by-side
Please register or to comment