Document main sources of breakage and add corresponding actions authored by intrigeri's avatar intrigeri
......@@ -53,6 +53,13 @@ Maintaining the pool is:
is not resilient to broken mirrors" and "there's often a broken mirror",
any breakage impacts UX negatively
The most common sources of breakage are:
- serves outdated data due to buggy rsync scheduling
- TLS certificate expired (tails#17754)
- server is down for maintenance
- web server crashes and not restarted automatically
Additionally, slow mirrors make our monitoring take vastly longer than it could.
This makes it difficult for sysadmins to schedule properly (sysadmin#17702),
and reports error later than it could.
......@@ -100,8 +107,13 @@ Improves:
Actions:
- Incrementally remove existing unreliable mirrors from the pool: permanently
remove mirrors that had problems at least twice in the last 6 months.
- Incrementally remove existing unreliable mirrors from the pool:
- Permanently remove mirrors that had problems at least twice in the last 6 months.
- Remove mirrors that expose red flags such as:
- We're faster than the mirror operator to notice breakage on their side.
- The mirror uses an expired TLS certificate.
- The web server does not run under a supervisor that would restart it if it crashes.
- Maintenance operations that take the server down are not announced in advance.
- Add to the requirements for new mirrors: operated by a professional team, or
at least with a high-availability setup.
......
......