Reboot isotesters after test completion and in a less racy way (#11295, #10601)
The previous implementation rebooted isotesters, via wrap_test_Tails_ISO_{name}, before starting a test_Tails_ISO_{name} job. It was inherently racy: between "the time when reboot_job puts the node back online and wrap_test_* triggers test_Tails_ISO_", and "the time when test_Tails_ISO_ actually starts", on my local Jenkins for example, I see that 9 seconds have passed. So occasionally, another job, that was already in queue, and that was allowed to run on the same node, would be scheduled there in the meantime. Such a job can be, for example, another wrap_test_*, which will itself reboot the node again. And then we got two test_Tails_ISO_{name} jobs pinned to this node at the same time. One of them got stuck in the queue until the one that won the race completed. Or, depending on the exact timing, an isotester was left in offline mode because we had called toggleOffline an odd number of times instead of an even one. So, let's instead reboot isotesters at the end of a test_Tails_ISO_* build, in a way that's not affected by the race condition described above. As a bonus, this gets rid of the confusing wrap_test_Tails_ISO_* jobs, which will improve the developer experience. Instead of one such job per branch, we get 2 new jobs (reboot_node, keep_node_busy_during_cleanup) that are shared by all branches. This new implementation is still inherently racy as well, in that it still depends on a few arbitrary chosen sleep values. But as opposed to the previous implementation: - If these sleep values are large enough, in principle this implementation should not be racy. - These sleep values can all safely be raised, up to the point when we always win the race conditions they're about. The only negative impact will be a minor increase in the test_Tails_ISO_* run time, which is negligible compared to the hours these jobs take anyway. Ideally, we would not sleep, but instead somehow check if what we're waiting for has happened; this might raise non-trivial issues about API access control and corresponding security matters. I'm not spending time on this right now because the long-term plan is to avoid rebooting isotesters between test runs (#17216), either via more robust cleanup mechanisms, or as suggested by zen, by running the test suite in containers. Deployment notes: we need to ensure that the master has enough executors configured so it can run as many reboot_job concurrent builds as necessary. This job can't run in parallel for a given test_Tails_ISO_* build, so what we strictly need is: as many executors on the master as we have nodes allowed to run test_Tails_ISO_*, i.e. currently 5. When we get more nodes that can run test_Tails_ISO_*, we'll need to bump this number. This commit is based on unfinished work started by myself, and later improved by bertagaz, on the feature/reboot-after_post_build_trigger branch. This commit also does this: - Remove "DUMMY" parameter. I guess we used to need this in order to be able to use a parameterized trigger, because it needs at least one parameter. Given we now explicitly pass the UPSTREAMJOB_BUILD_NUMBER parameter, we can remove this DUMMY parameter. - Split sequences of shell commands into separate steps. - Document some build steps that are otherwise unaffected by this commit.
Loading
Please register or sign in to comment