Test suite: isoworkers have intermittent issues with restoring snapshots
Here's the number of times a scenario failed due to "Failed to restore snapshot" by date:
150 2025-05-08
1 2025-05-29
19 2025-06-13
12 2025-06-14
136 2025-06-16
28 2025-06-17
Here's the number of affected automated test suite runs per isoworker:
4 NODE_NAME=isoworker1.dragon
5 NODE_NAME=isoworker2.dragon
4 NODE_NAME=isoworker3.dragon
4 NODE_NAME=isoworker4.dragon
4 NODE_NAME=isoworker5.dragon
1 NODE_NAME=isoworker6.iguana
The iguana one is from a run from 2025-05-29 where "Failed to restore snapshot" happened once. So it mostly affects dragon.
The 150 failures on 2025-05-08 is form a single run. Most other affected runs has much fewer such failures.
Let's look at a recent, representative run: https://jenkins.tails.boum.org/job/test_Tails_ISO_stable/5717/
Looking at the debug.log
we see that the failure occurs in post_snapshot_restore_hook()
:
02:45:16.470401009: Screen: trying to find GnomeApplicationsMenu.png
02:45:17.500510498: Screen: found GnomeApplicationsMenu.png at (132, 16)
02:45:20.504079943: Screen: trying to find GnomeApplicationsMenu.png
Failed to restore snapshot, retrying...
Which is this part:
try_for(10, delay: 0) do
@screen.find(pattern)
# Sometimes the display becomes inactive 1 to 2 seconds after the
# snapshot was restored. To catch those cases, we wait a short time
# and make sure that we can still find the pattern.
# We don't want this to be longer than necessary, because this will
# slow down all tests which restore snapshots.
sleep 3
@screen.find(pattern)
[...]
And sometimes it even fails on the first @screen.find(pattern)
.
I suspect the problem is that dragon is under very heavy load when this happens, so much that even a single @screen.find(pattern)
takes longer than the 10 second timeout.