- Current design
- Improve UX and lower maintenance cost (2021)
- Initial research
See the corresponding design document.
Improve UX and lower maintenance cost (2021)
Tracking issue: tails#18262
Our current mirror pool, and how we use and maintain it, has several major problems.
Users face errors while downloading or upgrading Tails.
Some mirrors are regularly broken, sometimes for minutes, sometimes for days: they're maintained by volunteers on their spare time and have no high-availability setup.
Our software is not resilient to broken mirrors:
Some mirrors are probably slow enough that they're the bottleneck when downloading or upgrading Tails, even over Tor.
Delays in sync'ing new data from our rsync server adds delay to the release process. This is caused by:
- Some mirrors are slow to sync'.
- Our pool includes many mirrors that are competing for the 1Gbps uplink of our rsync server.
Maintenance of the pool of mirrors
Maintaining the pool is:
- tedious and repetitive work: there's very often temporary mirror breakage to check and possibly handle
- time-sensitive, and thus stressful: due to the combination of "our software is not resilient to broken mirrors" and "there's often a broken mirror", any breakage impacts UX negatively
The most common sources of breakage are:
- serves outdated data due to buggy rsync scheduling
- TLS certificate expired (tails#17754 (closed))
- server is down for maintenance
- web server crashes and not restarted automatically
Additionally, slow mirrors make our monitoring take vastly longer than it could. This makes it difficult for sysadmins to schedule properly (sysadmin#17702 (closed)), and reports error later than it could.
Make our software more resilient to broken mirrors
- UX problems caused by unreliable mirrors
- Maintenance becomes less stressful
Shrink the pool
- Release Management problems caused by many mirrors competing for bandwidth
- Maintenance workload
- Raise the bar for accepting and keeping mirrors in the pool (see below)
Raise the bar for mirrors: reliability
- Maintenance workload
- UX, until all our software is super robust against broken mirrors
Document how to incrementally remove existing unreliable mirrors from the pool:
- Permanently remove mirrors that had problems at least twice in the last 6 months.
- Remove mirrors that expose red flags such as:
- We're faster than the mirror operator to notice breakage on their side.
- The mirror uses an expired TLS certificate.
- The web server does not run under a supervisor that would restart it if it crashes.
- Maintenance operations that take the server down are not announced in advance.
- Remove mirrors that we already know don't pass the above criteria
- Add to the requirements for new mirrors: operated by a professional team, or at least with a high-availability setup.
Raise the bar for mirrors: performance
- UX problems caused by slow mirrors
- Release Management problems caused by slow mirrors
- Add to the requirements for new mirrors: 1Gbps uplink
- Make our monitoring identify slow mirrors, regardless of what's theoretical uplink speed
- Document how to incrementally remove from the pool mirrors that prove too slow in practice.
This probably won't happen in 2021 but it would be good on the longer term:
- Check again if non-NIH mirror pool management solutions could fit our current needs and requirements: tails#18263