Rethink how we monitor and maintain our mirror pool
Rationale
The current way we do this, with check-mirror.rb
run multiple times a day on our infra, was very specific to the needs of our previous setup. It reports over email. This is noisy, makes the mirror pool maintenance work tedious and costly in terms of context-switches, and does not easily give a historical perspective.
Now that me have Mirrorbits (#18263 (closed)), perhaps we don't need exactly that anymore.
I think we should take a step back, build a list of our current needs, and figure out what's a good way to address them.
And then this will inform us wrt. what kind of maintenance is needed nowadays, which will surely impact Define the work of the Mirrors Team and reconsi... (#16930 - closed), and possibly who's responsible for said maintenance.
Needs
Users
Scope: ISO & USB images, and IUKs
-
[U-reliability]
I can download the correct data reliably. -
[U-speed]
I can download quickly (my Internet connection and Tor circuit are the bottlenecks). -
[U-safety]
I'm safe against a compromised mirror that serves incorrect data.
Release Managers
Our release process currently uses check-mirrors.rb
in several ways. The way it's currently done is outdated (Adjust the release process: downloads are now d... (#19335 - closed)), but the underlying needs is basically always the same:
-
[RM-freshness]
I can tell whether enough fast mirrors have the new images so I can send a call for testing to tails-testers@boum.org, ask manual testers to do the "Incremental upgrades" test, and release.
Mirror pool maintainers
- Ensure our mirror pool adequately serves the needs of the other stakeholders listed in this section.
-
[Maint-health]
We can detect issues about the general health of our mirror pool (e.g. not enough up-to-date, fast, and reliable mirrors)
Status quo
Here's what we currently have and how it addresses the needs listed above.
Safety
Downloads are checked on the client side (JS on our website, Upgrader), so [U-safety]
is orthogonal to how we monitor our mirror pool.
check-mirrors.rb
- checks freshness of mirrors over HTTPS using the
project/trace
flag file →[RM-freshness]
,[Maint-health]
- downloads the latest ISO and USB images from mirrors and verifies OpenPGP signature →
[U-reliability]
- verifies that only 1 version is available on mirrors
- reports slow mirrors →
[U.speed]
,[Maint-health]
Mirrorbits
- checks freshness of mirrors using rsync →
[RM-freshness]
Proposal
Updated monitoring setup
-
[U-reliability]
, partially[Maint-health]
- Mirrorbits won't redirect users to a mirror that's completely down. But it does not check that downloading over HTTPS works.
-
Run(not needed: Mirrorbits does regular HEAD HTTPS requests to mirrors, and verifies certificates, so we know their web server is not totally broken)check-mirrors.rb
once a week to detect the — hopefully rare — case when the mirror's rsync server is up, but its web server is broken. - mirmon also probes over HTTPS so mirror pool maintainers can spot trouble on that dashboard too (with historical data).
- Keep setting the bar relatively high, in terms of reliability, for existing and new mirrors.
-
[U.speed]
, partially[Maint-health]
- Run
check-mirrors.rb
once a week to detect mirrors that are consistently slow. - Keep setting the bar relatively high, in terms of speed, for existing and new mirrors.
- Run
-
[Maint-health]
: given the partial solutions above only give few data points and no historical perspective, additionally set up mirmon or a similar dashboard that exposes both current and historical data about the state of our mirror pool & of individual mirrors. -
[RM-freshness]
: we can keep usingcheck-mirrors.rb
. But if Mirrorbits or mirmon readily exposes the information we need, the RMs can choose to switch to that.- The release process was updated to use a combination of Mirrorbits and
check-mirrors.rb
.
- The release process was updated to use a combination of Mirrorbits and
Incidentally, this will allow us to drop a bunch of code in check-mirrors.rb
:
- saving and reading state
- probably more code about special cases that are now better handled with other tools
Impact on mirror pool maintenance work
- Much less incoming email to process: only once a week.
- Most of the work is not time-sensitive anymore.
- A dashboard provides the big picture and historical perspective.
And to prepare Define the work of the Mirrors Team and reconsi... (#16930 - closed) (we don't have to reach a conclusion about this here):
- Primary responsibilities become:
- Process offers of new mirrors (check compliance with our requirements, test, add to
mirrors.json
… or gently decline). - Identify broken web servers (via weekly email report), disable them in
mirrors.json
, reach out to mirror operator (this is the only time-sensitive part of the job); re-enable once fixed. - Identify slow mirrors (via weekly email report), remove them from
mirrors.json
, and ask mirror operator to stop pulling via rsync. - Identify general health problems (by looking at mirmon every month or so)
- Process offers of new mirrors (check compliance with our requirements, test, add to
- Who could do this:
- IMO (intrigeri) this needs to be done by folks with significant and reliable commitment, i.e. Core Workers.
- Most current CWs have the technical skills to do this work.
- It's so little work that it should not impact significantly any CW's capacity.
- 1 person is enough:
- Time-sensitive work is required only in rare cases, and we would be very unlucky if that happened precisely when the maintainer is AFK for an extended period.
- They can of course ask advice or support to other Tails folk when needed :)
- intrigeri would like to maintain the mirror pool using this new setup for a few months. This will allow him to fine-tune things if needed. After which, he would like this role to rotate.
- A good understanding of how the mirror pool is used can help (currently that's FT and UX). OTOH this seems clearly infrastructure maintenance, which fits nicely into the sysadmin role definition.
Impact on Release Managers
- No need to remove outdated mirrors from the pool nor to notify mirror pool maintainers.
Impact on sysadmins
Initial setup
intrigeri could probably do all this.
-
Adjust the check-mirrors.rb
cronjobs to run only once a week, and drop the state file management. -
Set up mirmon, which boils down to: - systemd service + timer to export the pool of mirror from Mirrorbits to mirmon (there's a Mirrorbits command for that)
- install mirmon
- if needed, configure mirmon probes
- systemd service + timer to run mirmon → generates HTML static page
- vhost + reverse proxy to serve the mirmon HTML page
Maintenance
Keep the mirmon setup working. The last upstream release was 5 years ago so there'll be little churn.
Timeline
- We can switch to "run
check-mirrors.rb
once a week" whenever we want. - We can set up mirmon whenever we want.
- A few months after Make the Upgrader use the mirror redirector (!983 - merged) is released, we can assume that most Tails systems will only use the Mirrorbits dispatcher for downloading upgrades. At that point we can stop monitoring the fallback DNS pool. Until then, monitoring it once a week should be enough: it contains only 2 mirrors, that are super-mega-reliable.