Rationale: - Right now: - These pages' Markdown will be rendered when browsing our repository in the GitLab web interface - The .md extension is more common nowadays: GitHub, GitLab, and various other major players have settled on it. Let's get used to it. - If we decide to migrate our blueprints to GitLab (#18079), this will be a necessary first step.

intrigeri · 7db7c482
--- a/monitor_servers.md
+++ b/monitor_servers.md
+[[!tag archived]]
+
+Corresponding parent ticket: [[!tails_ticket 5734]]
+
+[[!toc levels=3]]
+
+# Introduction
+
+## Why?
+
+* Some pieces of our infrastructure are critical to e.g.:
+  - the development process (if the ISO build fails, developers
+    cannot work)
+  - the release process -- which may block us from putting out
+    emergency security fixes
+  - users (if the APT repository is down, the "additional software
+    packages" persistence feature is broken)
+
+* We want to avoid contributors getting used to ignore alerts sent by
+  our CI system. The more false positives there are, the more they
+  will "learn" to do so. Here we want to diminish the rate of false
+  positives caused by malfunctioning infrastructure.
+
+* We want to shorten the dev/feedback loop for sysadmins when they
+  deploy changes, and also when changes are automatically applied
+  (e.g. Puppet agent passes, or automatic APT upgrades).
+
+* We want to be notified when a service we run doesn't come back up
+  properly post-reboot, without having to manually test every service.
+
+* We want to minimize the rate of non-sysadmins discovering and
+  reporting problems _first_, that is before we learn about it.
+  This is highly subjective, but replying "we're aware of this problem
+  and are working on it" is much more confidence inspiring than
+  "really, it's broken?"
+
+## Nomenclature
+
+Here, we call:
+
+* _machine_: a computer (be it bare metal or virtual) and its
+  operating system
+* _monitored machine_: a machine we monitor
+* _monitoring machine_: the machine(s) that monitors the... _monitored
+  machines_
+* _monitoring system_, or _monitoring setup_: all the software
+  components that we run so that the monitoring machine can monitor
+  the monitored ones, and their configuration
+
+Note that the monitoring machine may very well be, at the same time,
+itself be monitored (be it by itself, or by another monitoring
+machine).
+
+Requirements
+============
+
+## Human interface
+
+The monitoring system:
+
+* MUST send email notifications to the sysadmin(s) in charge, to lower
+  the downtime.
+* MUST offer an overview of the status of our systems, via a web
+  interface that works within Tor Browser with the security slider set
+  to Medium-High.
+* MAY additionally offer a read-only version of this overview, that we
+  may want to make available to selected contributors, or anonymous
+  users. Needless to say, this must be carefully balanced with the
+  security implications of such a system (in other words, a set of
+  exported static HTML pages is totally fine, but a huge dynamic web
+  application is probably a non-starter).
+* MUST support configuring, with a per-check/per-service granularity,
+  a threshold of N failures _in a row_ before an alert is raised.
+  Still, it SHOULD support triggering alerts depending on the
+  frequency of such failures, even when they never fail twice in a row
+  (we don't want to miss the fact that `$service` is down for
+  5 minutes every day). Implementation details may vary, but you get
+  the idea.
+
+## Threat model
+
+### Compromised monitored machine
+
+* We do not try to avoid the fact that it can report wrong information
+  (this includes missing information) about itself.
+* It MUST NOT result in a compromise of the monitoring machine.
+* It MUST NOT be able to DoS the sysadmin(s) in charge, e.g.
+  by flooding them with alerts.
+* It MUST NOT result in a compromise of the network traffic between
+  other monitored machines and the monitoring machine (e.g. if that
+  traffic is encrypted, the monitored machines MUST NOT use the same
+  private key).
+* It SHOULD NOT be able to alter the information about other
+  monitored machines.
+
+### Compromised monitoring machine
+
+* We do not try to avoid the fact that it can DoS the sysadmin(s) in
+  charge, e.g. by flooding them with alerts.
+* We do not try to avoid the fact that it can report wrong information
+  about the monitored machines.
+* It MUST NOT be able to run arbitrary code as root on any of the
+  monitored machines.
+* It SHOULD NOT be able to run arbitrary code as a non-privileged user
+  on any of the monitored machines.
+
+### Network attacker
+
+Here, we consider an attacker that may be active or passive, and can
+sit at any point they choose on the Internet.
+
+We accept the risk that a network attacker:
+
+* can enumerate the machines and services we monitor;
+* can view the reports, test results, and any such information about
+  monitored services, that the monitoring system needs to learn; this
+  of course implies that we should be careful about what kind of
+  information flows this way: it MUST NOT be a big deal if it leaks
+  into the hands of an adversary;
+* can DoS our monitoring, e.g. by blocking network connections;
+* can spoof the reports, test results and alike about monitored
+  services that a client has no credible means to authenticate.
+
+However, a network attacker:
+
+* SHOULD NOT be able to spoof the reports, test results and alike
+  that monitored machines send about themselves;
+* MUST NOT be able to run arbitrary code on the monitored machines;
+* MUST NOT be able to run arbitrary code on the monitoring machine.
+
+## Availability, sustainability
+
+Here, we assume that the entire monitoring system has both software
+components that run on the monitored machines (that we call the
+"agent"), and software components that run on the monitoring machine
+(that we call the "server"). Below, the _agent_ implicitly includes
+anything needed for basic usage (plugins, checks, whatever); and
+similarly, the _server_ implicitly includes its web interface, and
+anything needed for basic usage (plugins, checks, etc.).
+
+* The agent MUST be usually available in all of Debian oldstable,
+  stable, and testing -- possibly thanks to _pre-existing_ and
+  well-maintained official backports. All these versions of the agent
+  MUST be compatible with the chosen version of the server.
+
+* The server MUST be usually available either in current Debian stable
+  (Jessie), or in current Debian testing (Stretch). We are considering
+  running the version from Debian testing mainly because it might
+  avoid having to go through a costly upgrade process in a couple
+  years, e.g. to switch to the next major, incompatible version of
+  the software.
+
+* Both the agent and the server MUST be actively maintained in all the
+  versions of Debian we care about (see above). Hint: this excludes
+  Nagios 4.
+
+* Both the agent and the server MUST be DFSG-free.
+
+* For all involved software, the upstream project MUST be mature and
+  active. It MUST have a confidence inspiring future. We can't afford
+  having to migrate to a totally different monitoring setup in three
+  years, to the extent that this can be foreseen. Hint: given Nagios 4
+  is not an option (see above), this in turn excludes all older
+  versions of Nagios.
+
+* It SHOULD be realistically possible for external contributors to
+  have patches merged into the upstream codebase of the
+  involved software.
+
+* All the involved softwares MUST have a not-too-scary security
+  track record.
+
+## Configuration
+
+Here, we have two major desires. One is the ability for humans to
+easily review the monitoring system's configuration, or changes
+proposed to it, so that contributions are made easier. The other is
+the ability to include monitoring aspects within the description of
+the services we run, in a self-contained way, so that describing them
+in puppet is easier. Note that a system that satisfies the second
+requirement has great chances to also mostly satisfy the first one as
+well.
+
+The chosen monitoring system:
+
+* SHOULD allow encoding, in the description of a service (read: in the
+  corresponding Puppet class), how it needs to be monitored.
+  - Additionally, if this optional (but warmly welcome) requirement is
+    satisfied, then the "shared Puppet modules" we use SHOULD already
+    support the chosen monitoring system (hint: in practice, this
+    means something compatible with Nagios).
+  - Note: this gives us for free the ability to review the monitoring
+    configuration for service checks, but it is unrelated to our
+    ability to review the global configuration of the server
+    components, that run on the monitoring machine.
+
+* SHOULD allow humans to easily review the service checks
+  configuration. Really, that's a *strong* SHOULD. A system that
+  doesn't make this possible will need to have very serious advantages
+  in other areas to be attractive to us.
+
+* SHOULD allow humans to review the global configuration of the server
+  components, that run on the monitoring machine. This assumes that
+  said configuration is mostly static, and is unaffected when adding
+  or modifying service checks.
+
+## Adequacy to our resources
+
+Being able to operate the monitoring system for 20-50 monitored
+systems MUST NOT require Tails sysadmins to invest lots of time and
+become experts at hand-holding a complex software stack: the main
+focus of our system and automation engineers shall not become
+monitoring. For example, we won't like a monitoring system that is
+trivial to set up for monitoring 5-10 hosts, but requires adding more
+and more moving parts and complex optional components to be able to
+scale up to 50 hosts.
+
+## Miscellaneous
+
+* We run Tor hidden services, that we want to monitor, so the
+  monitoring system MUST allow using a configured SOCKS proxy for
+  specific checks (worst case, for _all_ checks, but it prevents us
+  from). Wrapping checks with `torsocks` might be an acceptable
+  option, depending on how involved and hackish this would be. Ability
+  to retry and not notify on first error is interesting here.
+
+## Hosting of the monitoring machine
+
+* The monitoring machine MUST be a virtual machine.
+* We MUST be enabled to admin the OS of the monitoring machine
+  ourselves:  we need to be root, we need to have a Puppet agent that
+  talks to our own puppetmaster, we want to do the initial
+  OS installation.
+* The monitoring machine MUST be hosted on infrastructure managed by
+  people the Tails sysadmins trust quite a bit.
+* The people who manage the underlying hardware and infrastructure
+  MUST be reactive and easy to get in touch with.
+* We MUST be given out-of-band access to the monitoring machine.
+* The monitoring machine MUST have unfiltered access to the Internet,
+  and SHOULD be assigned at least one public IPv4 address.
+* Hosting MUST be affordable (say, max. 20€/month).
+* The monitoring machine SHOULD allow at least some flexibility
+  regarding future "hardware" upgrades (e.g. allocating more disk
+  space, memory, CPU cores).
+* TODO: exact hardware specifications, depending on the chosen
+  monitoring system. Let's keep in mind that collecting exported
+  Puppet resources is expensive.
+
+<a id="services"></a>
+
+# Service and system checks
+
+Below, HIGH, MEDIUM and LOW are priority level wrt. the implementation
+of such checks.
+
+For description of individual services, see
+[[contribute/working_together/roles/sysadmins]]
+
+## All systems
+
+* HIGH: up and running!
+* HIGH: disk space usage (bytes and inodes)
+* HIGH: memory usage
+* MEDIUM: Puppet agent last run
+* MEDIUM: APT indices (aka. `apt-get update` was successfully run recently)
+* MEDIUM: `systemctl is-system-running` (see [[!tails_ticket 8262]])
+
+## APT repository
+
+* CRITICAL: `stable` APT suite over HTTP
+* CRITICAL: freezable APT repository, once it exists
+
+## Bitcoind
+
+* MEDIUM: compare `getblockcount` with what the Internet says it
+  should be (probably requires exporting the output of `bitcoin-cli
+  getblockcount` to a place that's readable by the monitoring agent)
+
+## BitTorrent
+
+* LOW: last Tails release is seeded
+
+## Gitolite
+
+* MEDIUM: `git pull` or `git clone` a test repository over all
+  supported protocols (currently: `git://` and SSH)
+
+## git-annex
+
+* HIGH: our Tor Browser archive must be reachable over HTTP, and
+  contain directories with tarballs
+
+## Jenkins
+
+* CRITICAL: the HTTP server must be up, and unauthenticated connection
+  must be forbidden (may require to install its TLS certificate, or to
+  skip certificate validation, or something)
+
+## Nightly builds
+
+* CRITICAL: <http://nightly.tails.boum.org/> must have directories for
+  the `stable` and `devel` branches, that contain ISO images
+
+## rsync
+
+* CRITICAL: check, over `rsync://`, that expected directories are there
+
+## Test suite infrastructure
+
+* HIGH: the (fake or limited) SSH and SFTP access used by core
+  contributors and robots when running the test suite must be up
+
+## Website
+
+* CRITICAL: <https://tails.boum.org/> must be up and working
+
+## WhisperBack relay
+
+* HIGH: SMTP server is up
+* MEDIUM: email is actually relayed (would be truly good to have, but
+  hard to implement, so the cost/benefit ratio is likely to be pretty
+  bad)
+
+## XMPP server
+
+* MEDIUM: responds on the TCP/IP port it is listening on