iguana became much slower ⇒ ISO builds take twice longer, test suite is very fragile
Impact
ISO builds
See https://jenkins.tails.boum.org/job/build_Tails_ISO_stable/buildTimeTrend
Since around June, builds on Jenkins started becoming much slower: comparable builds (with the website build pulled from the cache) used to take around 25 minutes, now they routinely take around 50 minutes. That is, iguana's performance for builds is now down to the level where lizard was a few years ago, when we decided to purchase faster hardware.
I tried to find a culprit in our build system that would explain this, but it seems that every part of the build is slower. For example, compressing the SquashFS used to take about 8 minutes, now it takes more than 20 minutes. Also, the impact on lizard seems much smaller at first glance, but I did not look closely yet. This tends to confirm that the problem was caused by a change on iguana rather than in our build system.
Test suite
These days our test suite is failing in brand new ways on iguana, while its behavior remained the same on lizard.
I see operations (run on iguana isotesters) that are normally pretty fast (less than 2 seconds) fail because of timeouts, e.g. Screen[match_screen]: taking screenshot
and OpenCV: starting opencv_match_template.py
. I also saw other timeouts being hit that felt unusual to me.
Potential explanations & solutions
Invalid explanations
It cannot be caused by the upgrade of iguana or isobuilderN.iguana to Bullseye as they both happened months before the problem appeared.
Kernel mitigations for side channel attacks via speculative execution
So my next thought was that a kernel upgrade might have brought in another set of performance-killing mitigations for Meltdown/Spectre/etc -like vulnerabilities. We lack historical data to confirm this hypothesis, but the data we have shows it's a possibility:
- 2022-06-13: upgrade 5.10.106-1 → 5.10.120-1 to fix DSA-5161, then reboot
- We don't know what kernel we were actually running before June 13: it could be that we never rebooted on 5.10.106-1. I seem to remember that earlier this year, our servers had an impressive uptime.
- 2022-03-09: 5.10.103-1 fixed another yet another side channel attack via speculative execution (DSA-5095)
The cheapest way to validate or invalidate my hypothesis would be to try #17387 (closed), first in iguana's isobuilders, and if that's not enough, on iguana host system. Or perhaps take benefit of the brand new 2nd CI server to test this :)
Test suite now doing lots more I/O
Early June I reconfigured iguana isotesters (#17866 (comment 190567)):
- Instead of mounting a
tmpfs
on/tmp/TailsToaster
, we're now writing directly to SSDs. - They now have 16GB of RAM, compared to 27GB previously.
This might explain the increased test suite fragility. I did not check whether this problem appeared at the time I made these changes.
I don't see how this can explain why builds suddenly became twice slower though.
Other ideas?
I did not check if test suite run time was impacted. That would be much more difficult because our test suite evolves at a faster pace than our build system, so it's difficult to compare apples to apples.