Sometimes fails to boot from USB on Jenkins with I/O errors
While working on tails#10720 (closed) I noticed a few I/O errors that blocked the boot. Let’s start compiling them here and we’ll see what can be done about this. So far I’ve seen such issues only when booting from USB. I’m curious if the same root cause can trigger more subtle issues, i.e. not blocking the boot but causing false positives later on (I’m thinking e.g. of all the scenarios in which the system under test seems frozen in Tails Greeter after clicking “log in”).
- The second (and last) “I start Tails from USB drive ”isohybrid" with
network unplugged and I login" step in “Cat:ing a Tails isohybrid to
a USB drive and booting it, then trying to upgrading it but ending
up having to do a fresh installation, which boots” fails: the test
suite options are added to the kernel command line, and then, while
the syslinux menu is still displayed and there’s no trace of Linux
- 7min30 later:
CHS: Error 0c00 reading sector 2247939 (140/14/6)and
EDD: Error 0c00 reading sector 2249987
- another minute later:
CHS: Error 0c00 reading sector 2251621 (140/72/34)and
EDD: Error 0c00 reading sector 2253669
- the test suite times out before anything else happens
- 7min30 later:
- (see 2 times) “I start Tails from USB drive ”old" with network unplugged and I login" fails with very similar CHS/EDD errors as above, but at some point Linux starts spitting output and there’s a kernel panic (“Failed to execute /init”)
- I’ve seen at least two Tails cat:ed from ISO fail to boot with SquashFS errors.
- “I start Tails from USB drive ”__internal" with network unplugged and I login with persistence enabled" in “Watching MP4 videos stored on the persistent volume should work as expected given our AppArmor confinement” fails with similar CHS/EDD errors as above; at some point Linux starts spitting output and there’s a kernel panic (“Failed to execute /init”)
- “I start Tails from USB drive ”_internal" with network unplugged and I login with read-only persistence enabled" in “I start Tails from USB drive ”_internal" with network unplugged and I login with read-only persistence enabled" fails with similar CHS/EDD errors as above
- “I start Tails from USB drive ”old" with network unplugged and I login" in “Creating a persistent partition with the old Tails USB installation”: kernel panic
- “I start Tails from USB drive ”old" with network unplugged and I login with persistence enabled" in “Writing files to a read/write-enabled persistent partition with the old Tails USB installation”: CHS/EDD errors
- “I start Tails from USB drive ”to_upgrade" with network unplugged and I login with persistence enabled" in “Booting a USB drive upgraded from ISO with persistence enabled” is stuck at “syslinux 6.03 EDD” and never displays the bootloader menu (see 02_39_57_Booting_a_USB_drive_upgraded_from_ISO_with_persistence_enabled.mkv attached)
I’ve never seen that outside of Jenkins, so I suspect a problem with the platform.
Random debugging ideas:
upgrade isotesters’ kernel to Linux 4.6: done between 2016-07-23 10:31 UTC and 11:00 UTC
upgrade isotesters’ QEMU to 2.5 from jessie-backports: done on 2016-07-27 around 08:00 UTC
check if the isotesters’ Journal has anything interesting around the time of the failure: nothing special in there
check if isotesters I/O load is as we expect it to be while running the test suite (including USB scenarios), i.e. most of our temporary data should stay in memory cache, and should never be flushed out to disk; the most recent work we’ve done in this area can serve as reference: #11175 (closed): I/O load is as expected (most action happens on tmpfs so isotesters don’t do much disk I/O)
- check if there’s anything interesting on Munin around the time of the failures: WIP; nothing I could notice; only a potential correlation with check-mirrors runs might be worth looking closer into
- give the system under testing a USB3 (
nec-xhci) controller: WIP (499c630, https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins/)
- crashes during memory erasure on shutdown, but with tails#10733 (closed) merged on top it seems to be fine: https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/
- not seen any I/O error on these branches yet, there’s hope!
- upgrade the host system’s QEMU to 2.5 from jessie-backports
- check virtual USB disk settings, e.g. the “cache” attribute
- check how we’re managing snapshots vs. disks in the scenarios that sometimes fail
Let’s keep in mind that we have other options, such as finally giving up on nested KVM for running our test suite on Jenkins, and instead getting a dedicated machine. Infrastructure-wise, IMO we are now ready to handle more machines (we have the VPN & Puppet setup in place for that). The additional engineering effort (support running multiple instances of our test suite concurrently on the same system) is certainly non-trivial, but it may still be cheaper than fixing this very ticket and all other bugs we only see on Jenkins. So let’s not spend too much time on this here.
Feature Branch: test/11588-usb-on-jenkins+10733
Parent Task: tails#10288 (closed)