Skip to content

Test suite: isoworkers have intermittent issues with restoring snapshots

Here's the number of times a scenario failed due to "Failed to restore snapshot" by date:

    150 2025-05-08
      1 2025-05-29
     19 2025-06-13
     12 2025-06-14
    136 2025-06-16
     28 2025-06-17

Here's the number of affected automated test suite runs per isoworker:

4 NODE_NAME=isoworker1.dragon
5 NODE_NAME=isoworker2.dragon
4 NODE_NAME=isoworker3.dragon
4 NODE_NAME=isoworker4.dragon
4 NODE_NAME=isoworker5.dragon
1 NODE_NAME=isoworker6.iguana

The iguana one is from a run from 2025-05-29 where "Failed to restore snapshot" happened once. So it mostly affects dragon.

The 150 failures on 2025-05-08 is form a single run. Most other affected runs has much fewer such failures.

Let's look at a recent, representative run: https://jenkins.tails.boum.org/job/test_Tails_ISO_stable/5717/

Looking at the debug.log we see that the failure occurs in post_snapshot_restore_hook():

02:45:16.470401009: Screen: trying to find GnomeApplicationsMenu.png
02:45:17.500510498: Screen: found GnomeApplicationsMenu.png at (132, 16)
02:45:20.504079943: Screen: trying to find GnomeApplicationsMenu.png
    Failed to restore snapshot, retrying...

Which is this part:

    try_for(10, delay: 0) do
      @screen.find(pattern)
      # Sometimes the display becomes inactive 1 to 2 seconds after the
      # snapshot was restored. To catch those cases, we wait a short time
      # and make sure that we can still find the pattern.
      # We don't want this to be longer than necessary, because this will
      # slow down all tests which restore snapshots.
      sleep 3
      @screen.find(pattern)
      [...]

And sometimes it even fails on the first @screen.find(pattern).

I suspect the problem is that dragon is under very heavy load when this happens, so much that even a single @screen.find(pattern) takes longer than the 10 second timeout.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information