Document main sources of breakage and add corresponding actions authored by intrigeri's avatar intrigeri
...@@ -53,6 +53,13 @@ Maintaining the pool is: ...@@ -53,6 +53,13 @@ Maintaining the pool is:
is not resilient to broken mirrors" and "there's often a broken mirror", is not resilient to broken mirrors" and "there's often a broken mirror",
any breakage impacts UX negatively any breakage impacts UX negatively
The most common sources of breakage are:
- serves outdated data due to buggy rsync scheduling
- TLS certificate expired (tails#17754)
- server is down for maintenance
- web server crashes and not restarted automatically
Additionally, slow mirrors make our monitoring take vastly longer than it could. Additionally, slow mirrors make our monitoring take vastly longer than it could.
This makes it difficult for sysadmins to schedule properly (sysadmin#17702), This makes it difficult for sysadmins to schedule properly (sysadmin#17702),
and reports error later than it could. and reports error later than it could.
...@@ -100,8 +107,13 @@ Improves: ...@@ -100,8 +107,13 @@ Improves:
Actions: Actions:
- Incrementally remove existing unreliable mirrors from the pool: permanently - Incrementally remove existing unreliable mirrors from the pool:
remove mirrors that had problems at least twice in the last 6 months. - Permanently remove mirrors that had problems at least twice in the last 6 months.
- Remove mirrors that expose red flags such as:
- We're faster than the mirror operator to notice breakage on their side.
- The mirror uses an expired TLS certificate.
- The web server does not run under a supervisor that would restart it if it crashes.
- Maintenance operations that take the server down are not announced in advance.
- Add to the requirements for new mirrors: operated by a professional team, or - Add to the requirements for new mirrors: operated by a professional team, or
at least with a high-availability setup. at least with a high-availability setup.
... ...
......