Sometimes fails to boot from USB on Jenkins with I/O errors
_Originally created by @intrigeri on [#11588 (Redmine)](https://public-redmine-archive.tails.boum.org/code/issues/11588)_
While working on tails/tails#10720 I noticed a few I/O errors that blocked the
boot. Let’s start compiling them here and we’ll see what can be done
about this. So far I’ve seen such issues only when booting from USB. I’m
curious if the same root cause can trigger more subtle issues, i.e. not
blocking the boot but causing false positives later on (I’m thinking
e.g. of all the scenarios in which the system under test seems frozen in
Tails Greeter after clicking “log in”).
- The second (and last) “I start Tails from USB drive ”isohybrid" with
network unplugged and I login" step in “Cat:ing a Tails isohybrid to
a USB drive and booting it, then trying to upgrading it but ending
up having to do a fresh installation, which boots” fails: the test
suite options are added to the kernel command line, and then, while
the syslinux menu is still displayed and there’s no trace of Linux
booting:
- 7min30 later: `CHS: Error 0c00 reading sector 2247939
(140/14/6)` and `EDD: Error 0c00 reading sector 2249987`
- another minute later: `CHS: Error 0c00 reading sector 2251621
(140/72/34)` and `EDD: Error 0c00 reading sector 2253669`
- the test suite times out before anything else happens
- (see 2 times) “I start Tails from USB drive ”old" with network
unplugged and I login" fails with very similar CHS/EDD errors as
above, but at some point Linux starts spitting output and there’s a
kernel panic (“Failed to execute /init”)
- I’ve seen at least two Tails cat:ed from ISO fail to boot with
SquashFS errors.
- “I start Tails from USB drive ”\_\_internal" with network unplugged
and I login with persistence enabled" in “Watching MP4 videos stored
on the persistent volume should work as expected given our AppArmor
confinement” fails with similar CHS/EDD errors as above; at some
point Linux starts spitting output and there’s a kernel panic
(“Failed to execute /init”)
- “I start Tails from USB drive ”\_*internal" with network unplugged
and I login with read-only persistence enabled" in “I start Tails
from USB drive ”*\_internal" with network unplugged and I login with
read-only persistence enabled" fails with similar CHS/EDD errors as
above
- “I start Tails from USB drive ”old" with network unplugged and I
login" in “Creating a persistent partition with the old Tails USB
installation”: kernel panic
- “I start Tails from USB drive ”old" with network unplugged and I
login with persistence enabled" in “Writing files to a
read/write-enabled persistent partition with the old Tails USB
installation”: CHS/EDD errors
- “I start Tails from USB drive ”to\_upgrade" with network unplugged
and I login with persistence enabled" in “Booting a USB drive
upgraded from ISO with persistence enabled” is stuck at “syslinux
6.03 EDD” and never displays the bootloader menu (see
02\_39\_57\_Booting\_a\_USB\_drive\_upgraded\_from\_ISO\_with\_persistence\_enabled.mkv
attached)
I’ve never seen that outside of Jenkins, so I suspect a problem with the
platform.
Random debugging ideas:
- ~~upgrade isotesters’ kernel to Linux 4.6~~: done between 2016-07-23
10:31 UTC and 11:00 UTC
- ~~upgrade isotesters’ QEMU to 2.5 from jessie-backports~~: done on
2016-07-27 around 08:00 UTC
- ~~check if the isotesters’ Journal has anything interesting around
the time of the failure~~: nothing special in there
- ~~check if isotesters I/O load is as we expect it to be while
running the test suite (including USB scenarios), i.e. most of our
temporary data should stay in memory cache, and should never be
flushed out to disk; the most recent work we’ve done in this area
can serve as reference: tails/sysadmin#11175~~: I/O load is as expected (most
action happens on tmpfs so isotesters don’t do much disk I/O)
- check if there’s anything interesting on Munin around the time of
the failures: WIP; nothing I could notice; only a potential
correlation with check-mirrors runs might be worth looking closer
into
- give the system under testing a USB3 (`nec-xhci`) controller: WIP
(499c630,
<https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins/)>
- crashes during memory erasure on shutdown, but with tails/tails#10733
merged on top it seems to be fine:
<https://jenkins.tails.boum.org/view/Tails_ISO/job/test_Tails_ISO_test-11588-usb-on-jenkins-10733/>
- not seen any I/O error on these branches yet, there’s hope\!
- upgrade the host system’s QEMU to 2.5 from jessie-backports
- check virtual USB disk settings, e.g. the “cache” attribute
- check how we’re managing snapshots vs. disks in the scenarios that
sometimes fail
Let’s keep in mind that we have other options, such as finally giving up
on nested KVM for running our test suite on Jenkins, and instead getting
a dedicated machine. Infrastructure-wise, IMO we are now ready to handle
more machines (we have the VPN & Puppet setup in place for that). The
additional engineering effort (support running multiple instances of our
test suite concurrently *on the same system*) is certainly non-trivial,
but it may still be cheaper than fixing this very ticket and all other
bugs we only see on Jenkins. So let’s not spend too much time on this
here.
Feature Branch: test/11588-usb-on-jenkins+10733
### Attachments
* [02_39_57_Booting_a_USB_drive_upgraded_from_ISO_with_persistence_enabled.mkv](https://redmine.tails.boum.org/code/attachments/download/1436/02_39_57_Booting_a_USB_drive_upgraded_from_ISO_with_persistence_enabled.mkv)
Parent Task: tails/tails#10288
### Related issues
- **Related to** tails/tails#12142
- **Blocks** tails/tails#11583
- [x] **Blocked by** tails/tails#11590
- [x] **Blocked by** tails/tails#10733
issue