|
|
[[!tag archived]]
|
|
|
|
|
|
Corresponding parent ticket: [[!tails_ticket 5734]]
|
|
|
|
|
|
[[!toc levels=3]]
|
|
|
|
|
|
# Introduction
|
|
|
|
|
|
## Why?
|
|
|
|
|
|
* Some pieces of our infrastructure are critical to e.g.:
|
|
|
- the development process (if the ISO build fails, developers
|
|
|
cannot work)
|
|
|
- the release process -- which may block us from putting out
|
|
|
emergency security fixes
|
|
|
- users (if the APT repository is down, the "additional software
|
|
|
packages" persistence feature is broken)
|
|
|
|
|
|
* We want to avoid contributors getting used to ignore alerts sent by
|
|
|
our CI system. The more false positives there are, the more they
|
|
|
will "learn" to do so. Here we want to diminish the rate of false
|
|
|
positives caused by malfunctioning infrastructure.
|
|
|
|
|
|
* We want to shorten the dev/feedback loop for sysadmins when they
|
|
|
deploy changes, and also when changes are automatically applied
|
|
|
(e.g. Puppet agent passes, or automatic APT upgrades).
|
|
|
|
|
|
* We want to be notified when a service we run doesn't come back up
|
|
|
properly post-reboot, without having to manually test every service.
|
|
|
|
|
|
* We want to minimize the rate of non-sysadmins discovering and
|
|
|
reporting problems _first_, that is before we learn about it.
|
|
|
This is highly subjective, but replying "we're aware of this problem
|
|
|
and are working on it" is much more confidence inspiring than
|
|
|
"really, it's broken?"
|
|
|
|
|
|
## Nomenclature
|
|
|
|
|
|
Here, we call:
|
|
|
|
|
|
* _machine_: a computer (be it bare metal or virtual) and its
|
|
|
operating system
|
|
|
* _monitored machine_: a machine we monitor
|
|
|
* _monitoring machine_: the machine(s) that monitors the... _monitored
|
|
|
machines_
|
|
|
* _monitoring system_, or _monitoring setup_: all the software
|
|
|
components that we run so that the monitoring machine can monitor
|
|
|
the monitored ones, and their configuration
|
|
|
|
|
|
Note that the monitoring machine may very well be, at the same time,
|
|
|
itself be monitored (be it by itself, or by another monitoring
|
|
|
machine).
|
|
|
|
|
|
Requirements
|
|
|
============
|
|
|
|
|
|
## Human interface
|
|
|
|
|
|
The monitoring system:
|
|
|
|
|
|
* MUST send email notifications to the sysadmin(s) in charge, to lower
|
|
|
the downtime.
|
|
|
* MUST offer an overview of the status of our systems, via a web
|
|
|
interface that works within Tor Browser with the security slider set
|
|
|
to Medium-High.
|
|
|
* MAY additionally offer a read-only version of this overview, that we
|
|
|
may want to make available to selected contributors, or anonymous
|
|
|
users. Needless to say, this must be carefully balanced with the
|
|
|
security implications of such a system (in other words, a set of
|
|
|
exported static HTML pages is totally fine, but a huge dynamic web
|
|
|
application is probably a non-starter).
|
|
|
* MUST support configuring, with a per-check/per-service granularity,
|
|
|
a threshold of N failures _in a row_ before an alert is raised.
|
|
|
Still, it SHOULD support triggering alerts depending on the
|
|
|
frequency of such failures, even when they never fail twice in a row
|
|
|
(we don't want to miss the fact that `$service` is down for
|
|
|
5 minutes every day). Implementation details may vary, but you get
|
|
|
the idea.
|
|
|
|
|
|
## Threat model
|
|
|
|
|
|
### Compromised monitored machine
|
|
|
|
|
|
* We do not try to avoid the fact that it can report wrong information
|
|
|
(this includes missing information) about itself.
|
|
|
* It MUST NOT result in a compromise of the monitoring machine.
|
|
|
* It MUST NOT be able to DoS the sysadmin(s) in charge, e.g.
|
|
|
by flooding them with alerts.
|
|
|
* It MUST NOT result in a compromise of the network traffic between
|
|
|
other monitored machines and the monitoring machine (e.g. if that
|
|
|
traffic is encrypted, the monitored machines MUST NOT use the same
|
|
|
private key).
|
|
|
* It SHOULD NOT be able to alter the information about other
|
|
|
monitored machines.
|
|
|
|
|
|
### Compromised monitoring machine
|
|
|
|
|
|
* We do not try to avoid the fact that it can DoS the sysadmin(s) in
|
|
|
charge, e.g. by flooding them with alerts.
|
|
|
* We do not try to avoid the fact that it can report wrong information
|
|
|
about the monitored machines.
|
|
|
* It MUST NOT be able to run arbitrary code as root on any of the
|
|
|
monitored machines.
|
|
|
* It SHOULD NOT be able to run arbitrary code as a non-privileged user
|
|
|
on any of the monitored machines.
|
|
|
|
|
|
### Network attacker
|
|
|
|
|
|
Here, we consider an attacker that may be active or passive, and can
|
|
|
sit at any point they choose on the Internet.
|
|
|
|
|
|
We accept the risk that a network attacker:
|
|
|
|
|
|
* can enumerate the machines and services we monitor;
|
|
|
* can view the reports, test results, and any such information about
|
|
|
monitored services, that the monitoring system needs to learn; this
|
|
|
of course implies that we should be careful about what kind of
|
|
|
information flows this way: it MUST NOT be a big deal if it leaks
|
|
|
into the hands of an adversary;
|
|
|
* can DoS our monitoring, e.g. by blocking network connections;
|
|
|
* can spoof the reports, test results and alike about monitored
|
|
|
services that a client has no credible means to authenticate.
|
|
|
|
|
|
However, a network attacker:
|
|
|
|
|
|
* SHOULD NOT be able to spoof the reports, test results and alike
|
|
|
that monitored machines send about themselves;
|
|
|
* MUST NOT be able to run arbitrary code on the monitored machines;
|
|
|
* MUST NOT be able to run arbitrary code on the monitoring machine.
|
|
|
|
|
|
## Availability, sustainability
|
|
|
|
|
|
Here, we assume that the entire monitoring system has both software
|
|
|
components that run on the monitored machines (that we call the
|
|
|
"agent"), and software components that run on the monitoring machine
|
|
|
(that we call the "server"). Below, the _agent_ implicitly includes
|
|
|
anything needed for basic usage (plugins, checks, whatever); and
|
|
|
similarly, the _server_ implicitly includes its web interface, and
|
|
|
anything needed for basic usage (plugins, checks, etc.).
|
|
|
|
|
|
* The agent MUST be usually available in all of Debian oldstable,
|
|
|
stable, and testing -- possibly thanks to _pre-existing_ and
|
|
|
well-maintained official backports. All these versions of the agent
|
|
|
MUST be compatible with the chosen version of the server.
|
|
|
|
|
|
* The server MUST be usually available either in current Debian stable
|
|
|
(Jessie), or in current Debian testing (Stretch). We are considering
|
|
|
running the version from Debian testing mainly because it might
|
|
|
avoid having to go through a costly upgrade process in a couple
|
|
|
years, e.g. to switch to the next major, incompatible version of
|
|
|
the software.
|
|
|
|
|
|
* Both the agent and the server MUST be actively maintained in all the
|
|
|
versions of Debian we care about (see above). Hint: this excludes
|
|
|
Nagios 4.
|
|
|
|
|
|
* Both the agent and the server MUST be DFSG-free.
|
|
|
|
|
|
* For all involved software, the upstream project MUST be mature and
|
|
|
active. It MUST have a confidence inspiring future. We can't afford
|
|
|
having to migrate to a totally different monitoring setup in three
|
|
|
years, to the extent that this can be foreseen. Hint: given Nagios 4
|
|
|
is not an option (see above), this in turn excludes all older
|
|
|
versions of Nagios.
|
|
|
|
|
|
* It SHOULD be realistically possible for external contributors to
|
|
|
have patches merged into the upstream codebase of the
|
|
|
involved software.
|
|
|
|
|
|
* All the involved softwares MUST have a not-too-scary security
|
|
|
track record.
|
|
|
|
|
|
## Configuration
|
|
|
|
|
|
Here, we have two major desires. One is the ability for humans to
|
|
|
easily review the monitoring system's configuration, or changes
|
|
|
proposed to it, so that contributions are made easier. The other is
|
|
|
the ability to include monitoring aspects within the description of
|
|
|
the services we run, in a self-contained way, so that describing them
|
|
|
in puppet is easier. Note that a system that satisfies the second
|
|
|
requirement has great chances to also mostly satisfy the first one as
|
|
|
well.
|
|
|
|
|
|
The chosen monitoring system:
|
|
|
|
|
|
* SHOULD allow encoding, in the description of a service (read: in the
|
|
|
corresponding Puppet class), how it needs to be monitored.
|
|
|
- Additionally, if this optional (but warmly welcome) requirement is
|
|
|
satisfied, then the "shared Puppet modules" we use SHOULD already
|
|
|
support the chosen monitoring system (hint: in practice, this
|
|
|
means something compatible with Nagios).
|
|
|
- Note: this gives us for free the ability to review the monitoring
|
|
|
configuration for service checks, but it is unrelated to our
|
|
|
ability to review the global configuration of the server
|
|
|
components, that run on the monitoring machine.
|
|
|
|
|
|
* SHOULD allow humans to easily review the service checks
|
|
|
configuration. Really, that's a *strong* SHOULD. A system that
|
|
|
doesn't make this possible will need to have very serious advantages
|
|
|
in other areas to be attractive to us.
|
|
|
|
|
|
* SHOULD allow humans to review the global configuration of the server
|
|
|
components, that run on the monitoring machine. This assumes that
|
|
|
said configuration is mostly static, and is unaffected when adding
|
|
|
or modifying service checks.
|
|
|
|
|
|
## Adequacy to our resources
|
|
|
|
|
|
Being able to operate the monitoring system for 20-50 monitored
|
|
|
systems MUST NOT require Tails sysadmins to invest lots of time and
|
|
|
become experts at hand-holding a complex software stack: the main
|
|
|
focus of our system and automation engineers shall not become
|
|
|
monitoring. For example, we won't like a monitoring system that is
|
|
|
trivial to set up for monitoring 5-10 hosts, but requires adding more
|
|
|
and more moving parts and complex optional components to be able to
|
|
|
scale up to 50 hosts.
|
|
|
|
|
|
## Miscellaneous
|
|
|
|
|
|
* We run Tor hidden services, that we want to monitor, so the
|
|
|
monitoring system MUST allow using a configured SOCKS proxy for
|
|
|
specific checks (worst case, for _all_ checks, but it prevents us
|
|
|
from). Wrapping checks with `torsocks` might be an acceptable
|
|
|
option, depending on how involved and hackish this would be. Ability
|
|
|
to retry and not notify on first error is interesting here.
|
|
|
|
|
|
## Hosting of the monitoring machine
|
|
|
|
|
|
* The monitoring machine MUST be a virtual machine.
|
|
|
* We MUST be enabled to admin the OS of the monitoring machine
|
|
|
ourselves: we need to be root, we need to have a Puppet agent that
|
|
|
talks to our own puppetmaster, we want to do the initial
|
|
|
OS installation.
|
|
|
* The monitoring machine MUST be hosted on infrastructure managed by
|
|
|
people the Tails sysadmins trust quite a bit.
|
|
|
* The people who manage the underlying hardware and infrastructure
|
|
|
MUST be reactive and easy to get in touch with.
|
|
|
* We MUST be given out-of-band access to the monitoring machine.
|
|
|
* The monitoring machine MUST have unfiltered access to the Internet,
|
|
|
and SHOULD be assigned at least one public IPv4 address.
|
|
|
* Hosting MUST be affordable (say, max. 20€/month).
|
|
|
* The monitoring machine SHOULD allow at least some flexibility
|
|
|
regarding future "hardware" upgrades (e.g. allocating more disk
|
|
|
space, memory, CPU cores).
|
|
|
* TODO: exact hardware specifications, depending on the chosen
|
|
|
monitoring system. Let's keep in mind that collecting exported
|
|
|
Puppet resources is expensive.
|
|
|
|
|
|
<a id="services"></a>
|
|
|
|
|
|
# Service and system checks
|
|
|
|
|
|
Below, HIGH, MEDIUM and LOW are priority level wrt. the implementation
|
|
|
of such checks.
|
|
|
|
|
|
For description of individual services, see
|
|
|
[[contribute/working_together/roles/sysadmins]]
|
|
|
|
|
|
## All systems
|
|
|
|
|
|
* HIGH: up and running!
|
|
|
* HIGH: disk space usage (bytes and inodes)
|
|
|
* HIGH: memory usage
|
|
|
* MEDIUM: Puppet agent last run
|
|
|
* MEDIUM: APT indices (aka. `apt-get update` was successfully run recently)
|
|
|
* MEDIUM: `systemctl is-system-running` (see [[!tails_ticket 8262]])
|
|
|
|
|
|
## APT repository
|
|
|
|
|
|
* CRITICAL: `stable` APT suite over HTTP
|
|
|
* CRITICAL: freezable APT repository, once it exists
|
|
|
|
|
|
## Bitcoind
|
|
|
|
|
|
* MEDIUM: compare `getblockcount` with what the Internet says it
|
|
|
should be (probably requires exporting the output of `bitcoin-cli
|
|
|
getblockcount` to a place that's readable by the monitoring agent)
|
|
|
|
|
|
## BitTorrent
|
|
|
|
|
|
* LOW: last Tails release is seeded
|
|
|
|
|
|
## Gitolite
|
|
|
|
|
|
* MEDIUM: `git pull` or `git clone` a test repository over all
|
|
|
supported protocols (currently: `git://` and SSH)
|
|
|
|
|
|
## git-annex
|
|
|
|
|
|
* HIGH: our Tor Browser archive must be reachable over HTTP, and
|
|
|
contain directories with tarballs
|
|
|
|
|
|
## Jenkins
|
|
|
|
|
|
* CRITICAL: the HTTP server must be up, and unauthenticated connection
|
|
|
must be forbidden (may require to install its TLS certificate, or to
|
|
|
skip certificate validation, or something)
|
|
|
|
|
|
## Nightly builds
|
|
|
|
|
|
* CRITICAL: <http://nightly.tails.boum.org/> must have directories for
|
|
|
the `stable` and `devel` branches, that contain ISO images
|
|
|
|
|
|
## rsync
|
|
|
|
|
|
* CRITICAL: check, over `rsync://`, that expected directories are there
|
|
|
|
|
|
## Test suite infrastructure
|
|
|
|
|
|
* HIGH: the (fake or limited) SSH and SFTP access used by core
|
|
|
contributors and robots when running the test suite must be up
|
|
|
|
|
|
## Website
|
|
|
|
|
|
* CRITICAL: <https://tails.boum.org/> must be up and working
|
|
|
|
|
|
## WhisperBack relay
|
|
|
|
|
|
* HIGH: SMTP server is up
|
|
|
* MEDIUM: email is actually relayed (would be truly good to have, but
|
|
|
hard to implement, so the cost/benefit ratio is likely to be pretty
|
|
|
bad)
|
|
|
|
|
|
## XMPP server
|
|
|
|
|
|
* MEDIUM: responds on the TCP/IP port it is listening on |