Rationale: - Right now: - These pages' Markdown will be rendered when browsing our repository in the GitLab web interface - The .md extension is more common nowadays: GitHub, GitLab, and various other major players have settled on it. Let's get used to it. - If we decide to migrate our blueprints to GitLab (#18079), this will be a necessary first step.

intrigeri · 7db7c482
--- a/hardware_for_automated_tests_take3.md
+++ b/hardware_for_automated_tests_take3.md
+This is about [[!tails_ticket 16960]] and related tickets.
+
+[[!toc levels=3]]
+
+# Rationale
+
+In 2016 we gave our main server some more RAM, as a temporary solution
+to cope with our workload, and as a way to learn about how to scale
+it. See [[blueprint/hardware_for_automated_tests_take2]] for our
+reasoning, lots of benchmark results, and conclusions.
+
+It's working relatively well so far, but we may need to upgrade again
+soonish, and improvements are always welcome in the contributor UX
+area:
+
+ * Our release process suffers from a number of delays that are in the critical
+   path and could be shortened if our Jenkins workers were faster: running the
+   test suite after importing Tor Browser, building the final images to check
+   for reproducibility, building IUKs.
+
+ * When Jenkins has built an ISO from one of our main branches or from
+   a branch that _Needs Validation_, since October 2017 we rebuild in
+   in a slightly different build environment to ensure it can be
+   [[rebuilt reproducibly|blueprint/reproducible_builds]].
+   This increased substantially the number of ISO image we build
+   which sometimes creates congestion in our CI pipeline (see below
+   for details).
+
+ * On our current setup, large numbers of automated test cases are
+   brittle and had to be disabled on Jenkins (`@fragile`), which quite
+   a few problematic consequences such as: it decreases the value our
+   CI system brings to our development process, it is demotivating for
+   test suite developers, it decreases the confidence developers have
+   in our test suite, and it forces developers to run the full test
+   suite elsewhere when they really want to validate a branch.
+   Interestingly, we don't see this much brittleness anywhere else,
+   even on a replica of our Jenkins setup that uses nested
+   virtualization too.
+
+ * As we add more automated tests, and re-enable tests previously
+   flagged as fragile, a full test run takes longer and longer.
+   We're now up to 300-350 minutes per run (depending on how many
+   concurrent jobs are running) without fragile tests.
+   We can't make it faster by adding RAM anymore nor by adding CPUs to ISO testers.
+   But faster CPU cores would fix that. With a fast (Intel E-2134) CPU
+   and a poor Internet connection, the same test suite (without fragile tests)
+   only takes about twice less time.
+
+ * Building our website takes a long while (11-15 minutes on our ISO
+   builders on lizard, i.e. 20% of the entire ISO build time, which is
+   56-70 minutes on lizard depending on how many concurrent jobs are running),
+   which makes ISO builds take longer than they could. This will get worse as new
+   languages are added to our website. This is a single-threaded task,
+   so adding more CPU cores or RAM would not help: only faster CPU
+   cores would fix that. For example, with a fast (Intel E-2134) CPU
+   and a poor Internet connection, the ISO build only takes:
+    - 25 minutes (including 5 minutes for building the website) on bare metal
+    - 25 minutes for 1 concurrent build in a VM, i.e. using nested virtualization
+    - 33 minutes (including ~5 minutes for building the website)
+      for 2 concurrent builds in VMs, in the worst case situation (builds
+      started exactly at the same time ⇒ they both need all their vCPUs
+      at the same time)
+
+ * Waiting time in queue for ISO build and test jobs is acceptable
+   most of the time, but too high during peak load periods:
+
+   - between 2017-06-17 and 2017-12-17:
+     - 4% of the test jobs had to wait for more than 1 hour.
+     - 1% of the test jobs had to wait for more than 2 hours.
+     - 2% of the ISO build jobs had to wait more than 1 hour.
+
+   - between 2018-05-01 and 2018-11-30:
+     - We've run 3342 ISO test jobs; median duration: 195 minutes.
+     - 7% of the test jobs had to wait for more than 1 hour.
+     - 3% of the test jobs had to wait for more than 2 hours.
+     - We've run 3355 ISO successful build jobs; median duration: 60 minutes.
+     - 7.2% of the ISO build jobs had to wait more than 15 minutes.
+     - 2% of the ISO build jobs had to wait more than 1 hour.
+     - We've run 3355 `reproducibly_build_*` jobs; median duration: 70 minutes.
+     - 10% of the `reproducibly_build_*` jobs had to wait more than 15 minutes.
+     - 3.2% of the `reproducibly_build_*` jobs had to wait more than 1 hour.
+
+   Not _that_ many jobs have to wait a long time, but this congestion happens
+   precisely when we need results from our CI infra ASAP, be it
+   because there's intense ongoing development or because we're
+   reviewing and merging lots of branches close to a code freeze, so
+   these delays hurt our development and release process.
+
+ * Our current main server was purchased at the end of 2014. The hardware will
+   hopefully last quite a few more years, but since it's now 5 years old, we've
+   budgeted its replacement. If our CI run on another machine, replacing this
+   main server would cost much less and the hardware would be chosen very
+   differently.
+
+ * The Tails community keeps needing new services; some of them need
+   to be hosted on hardware we control for security/privacy reasons
+   (which is not the case for our CI system):
+   - Added already:
+     - self-host our website
+     - Schleuder
+     - Weblate is a serious CPU, memory, and I/O consumer; besides, part of its
+       job is to rebuild the website, which is affected by the single-threaded
+       task performance limitations described above.
+     - survey platform
+   - Under consideration:
+     - [[!tails_ticket 14601 desc="Matomo"]] will require huge amounts
+       of resources and put quite some load on the system where we run
+       it; it needs to be hosted on hardware we control.
+     - [[!tails_ticket 9960 desc="Request tracker for help desk"]]):
+       unknown resources requirements; needs to be hosted on hardware
+       we control.
+   - WIP and will need more resources once they reach production
+     status or are used more often:
+
+ * We're allowing more and more people to use our CI infrastructure. This may
+   increase its resources needs eventually. From a development perspective,
+   ideally every merge request would go through our CI, which probably
+   requires a stronger separation between the CI workload and more
+   security-sensitive tasks that we run.
+
+# Options
+
+<a id="new-bare-metal-server"></a>
+
+## Bare metal server dedicated to CI
+
+Hard to tell whether this would fix our test suite fragility problems,
+hard to specify what hardware we need. If we get it wrong, likely we
+have to wait another 5 years before we try again ⇒ we need to rent
+essentially the exact hardware we're looking at so we can benchmark it
+before buying.
+
+Pros:
+
+ * No initial development nor skills to learn: we can run our test
+   suite in exactly the same way as we currently do.
+ * Can provide hardware redundancy in case lizard suddenly dies.
+ * We control the hardware and have a good relationship with
+   a friendly collocation.
+
+Cons:
+
+ * High initial money investment… unless we get it sponsored or get
+   a big discount from a vendor.
+ * On-going cost for hosting a second server.
+
+Extra options:
+
+ * If we want to drop nested virtualization to get more performance,
+   then we have non-negligible development costs and hard sysadmin
+   problems to solve ([[!tails_ticket 9486]]):
+   - We currently _reboot_ isotesters between test suite runs ⇒ if we
+     go this way we need to learn how to clean up after various kinds
+     of test suite failure.
+   - Our test suite currently assumes only one instance is running on
+     a given system ⇒ if we go this way we have to remove
+     this limitation.
+
+Specs:
+
+ * For simplicity's sake, the following assumes we have Jenkins
+   workers that are each able to run all the kinds of jobs we have,
+   which would be a good idea anyway.
+ * CPU: assuming 2 cores (4 hyperthreads) per Jenkins worker, for
+   8 workers, we need 16 cores at ≥ 3.5 GHz base frequency, which is
+   roughly equivalent to 4 × the Intel NUC mentioned below.
+   Our options are:
+    - 4 × [quad-core CPU](https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&2_MarketSegment=Server&3_CoreCount-Min=4&3_CoreCount-Max=4&3_ClockSpeed-Min=3500&2_VTX=True&0_VTD=True&2_StatusCodeText=4&0_StatusCodeText1=3,4)
+      (On 2020-02-04, 1 result with a not totally crazy price:
+      [Xeon Gold 5222](https://ark.intel.com/content/www/us/en/ark/products/192445/intel-xeon-gold-5222-processor-16-5m-cache-3-80-ghz.html) → 4×105 W = 420 W)
+    - 2 × [octo-core CPUs](https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&2_MarketSegment=Server&3_CoreCount-Min=8&3_CoreCount-Max=8&3_ClockSpeed-Min=3500&2_VTX=True&0_VTD=True&2_StatusCodeText=4&0_StatusCodeText1=3,4)
+      (On 2020-02-04, 1 result: [Xeon Gold
+      6244](https://ark.intel.com/content/www/us/en/ark/products/192442/intel-xeon-gold-6244-processor-24-75m-cache-3-60-ghz.html) → 2 × 150 W = 300 W)
+    - 1 × [16-core CPU](https://ark.intel.com/content/www/us/en/ark/search/featurefilter.html?productType=873&2_MarketSegment=Server&3_CoreCount-Min=16&3_ClockSpeed-Min=3500&2_VTX=True&0_VTD=True&2_StatusCodeText=4&0_StatusCodeText1=3,4)
+      (On 2020-02-04, no result.)
+    - Higher-density systems, with 2+ servers in a chassis e.g.
+      Supermicro Twin solutions, might allow using cheaper CPUs that
+      don't support multi-processor setups.
+ * RAM:  
+   + 232 GB = 29 GB × 8 Jenkins workers  
+   + Jenkins VM + host system + a few accessory VMs  
+   = round to 256 GB; 192 GB should work with super fast storage
+   (at least 4 × NVMe × 2 for RAID-1) if that's cheaper
+ * storage:  
+   + 480 GB = 60 GB × 8 Jenkins workers  
+   + 600 GB for the Jenkins artifacts store  
+   +  70 GB for APT cacher (make it cache ISO history too)  
+   + Jenkins VM + host system + a few accessory VMs  
+   = round to 1.5 TB × 2 (RAID-1)
+
+Cost estimate: $14-16k, depending on the chosen memory and storage setup
+(based on <https://www.thinkmate.com/system/rax-xt24-42s1>
+or <https://www.thinkmate.com/system/superserver-2029bt-dnc0r>).
+
+<a id="gamer-option"></a>
+
+## Bare metal gaming setup in a server housing dedicated to CI
+
+CPU's targeted at the gaming sector are likely to provide the processing power we need, for a way lower price than regular server hardware. In particular, the new generation AMD Threadripper CPU's seem to be a good match. This would involve placing a gaming motherboard in a server casing.
+
+Pros:
+
+ * No initial development nor skills to learn: we can run our test
+   suite in exactly the same way as we currently do.
+ * Can provide hardware redundancy in case lizard suddenly dies.
+ * We control the hardware and have a good relationship with
+   a friendly collocation.
+ * Relatively cheap
+
+Cons:
+
+ * On-going cost for hosting a second server.
+ * No IPMI
+ * Will require some research into cooling
+
+Extra options:
+
+ * We could get a KVM over IP device to compensate for the lack of IPMI
+ * Or we could wait for a server board with TRX40 socket and IPMI support to enter the market
+
+Specs:
+
+ * The Threadripper 3960X has 24 cores at a 3.8Ghz base frequency. This easily meets our required 16 cores at ≥ 3.5 GHz base frequency.
+ * A board like the Gigabyte TRX40 Aorus Xtreme can provide 256GB of DDR4 ECC memory and dual PCI-e 4.0 x8 1.6TB for storage.
+
+Cost estimate: roughly 1k for the motherboard, 1.5k for the CPU, 2k for the memory, 1.5k for the storage, 0.5k for housing and extras ~= e6.5k for the most fancy version with super fast storage. Cutting down on the storage and motherboard could bring this down to roughly e5k.
+
+<a id="hacker-option"></a>
+
+## Custom-built cluster of consumer-grade hardware dedicated to CI, aka. the hacker option
+
+For example, we could stuff 4 × Intel NUC or similar
+together in a custom case, with whatever cooling, PoE and network boot
+system this high-density cluster would need. Each of these nodes
+should be able to run 2 Jenkins workers.
+
+### Pros and cons
+
+Pros:
+
+ * Potentially scalable: if there's room left we can add more nodes in
+   the future.
+ * As fast, or actually even faster, as server-grade hardware.
+ * We have one such node to play with and benchmark results already.
+
+Cons:
+
+ * Lots of initial research and development: casing, cooling, hosting,
+   power over Ethernet, network boot, remote administration
+ * High initial money investment
+ * Hosting this is a hard sell for collocations. For example, the folks
+   who'll do hardware debugging and parts replacement onsite may frown
+   upon the extremely non-standard setup.
+ * On-going cost for hosting this cluster.
+ * Needs some development work to fully benefit from the performance
+   improvements.
+
+### Availability
+
+ * Intel: in the ninth generation NUC9i7 (availability: 2020Q1),
+   NUC9V7QNX is the one that supports vPro
+ * Other vendors have started selling UCFF boards/kits with fast CPUs.
+ * Similar smallish [[!wikipedia Computer_form_factor desc="form
+   factors"]] would be worth investigating, e.g. there are plenty of
+   [[!wikipedia Mini-ITX]] options on the market that could give us
+   the high density we need.
+   - [Supermicro X11SCL-IF](https://www.supermicro.com/products/motherboard/X11/X11SCL-iF.cfm), $200
+     + <https://www.intel.com/content/www/us/en/products/processors/xeon/e-processors.html?Intel+Technology=900393> 300€
+     + 2 × 16GB RAM Unbuffered ECC UDIMM, DDR4-2666MHz (e.g. CT16G4WFD8266) $400
+     + CPU heatsink SNK-P0049A4 $50
+     + mini-ITX case with power supply (e.g. Antec ISK310-150) 95€
+     + SSD M.2 PCI-E 3.0 x4 250GB 90€
+   - ASRock mini-ITX motherboards with IPMI that support suitable CPUs:
+     E3C226D2I, E3C236D4I-44E85, EPC612D4I
+   - Supermicro X10SDV series are affordable but support only too slow CPUs
+   - nicer Supermicro options, e.g. X11SSV-M4F, are too expensive (1000€ for the motherboard only)
+* Supermicro SuperServer E300-9D is very tempting (IPMI) but more
+  expensive than a NUC.
+
+Cost estimate: 7k€ for hardware (including shipping to the USA and import taxes)
+and dozens of hours of custom work.
+
+### Remote management
+
+ * AMT (vPro) can be a pain as it shares the Ethernet interface
+   with the OS. IPMI would be a big plus as it integrates well
+   into the colo's existing setup.
+
+<a id="hacker-option-benchmark"></a>
+
+### Benchmarking results
+
+Summary:
+
+- Twice faster than lizard with 1 Jenkins executor on the node, that's
+  able to run one build or test job at a time, without nested
+  virtualization.
+
+- With 2 Jenkins executors on the node, each in its own VM that's
+  able to run one build or test job at a time, with nested virtualization,
+  and qcow2 disks (probably slower than LVs as used on lizard):
+  - light load, i.e. only one concurrent build/test: each build + test takes
+    43% less time than on lizard
+  - heavy load, i.e. two concurrent builds/tests: each build + test takes
+    44% less time than on lizard
+
+For detailed numbers, see above on this page: look for "E-2134".
+
+So if we had, say, 4 such boxes in a case, each with 2 Jenkins
+workers, in 24h they would build, reproduce and test ~55 branches.
+While during the same period, the 9 VMs on lizard build, reproduce
+and test ~40 branches. Even adding only 2 such boxes would
+increase the maximum throughput of our CI by 69% and immensely lower
+latency during heavy load times.
+
+### Required development work
+
+The above benchmarks assume perfect load distribution, which requires
+Jenkins slaves that
+can run both build and test jobs. It works fine in local limited
+testing but this is not what we have set up on lizard at the moment:
+there's a risk that a failed test job leaves the system in a bad shape
+and breaks the following build job as we don't reboot after build
+jobs. We might need to either make test jobs more robust on failure
+([[!tails_ticket 17216]]),
+or to start rebooting the Jenkins slaves after build jobs as well.
+
+To utilize more fully such nodes, we may need to raise the number of
+vCPUs allocated to TailsToaster. For example, assuming 8 hyperthreads
+per box, and 2 VMs per box, with our current settings (2 vCPUs
+allocated to TailsToaster and Cucumber eats another vCPU core):
+
+ - when running 1 single test suite job,
+   up to 3 hyperthreads out of 8 are used;
+
+ - when running 2 concurrent test suite jobs,
+   up to 2×3=6 hyperthreads out of 8 are used.
+
+Raising this number may make the test suite run faster at least when
+there's only 1 test suite job running. It might even make it faster,
+in most scenarios, when 2 concurrent test suite jobs are running: they
+generally won't be using that much CPU power at the same time. OTOH it
+could introduce some test fragility.
+
+We should first measure if it's worth the effort.
+
+## Run builds and/or tests in the cloud
+
+<div class="caution">
+Some of the following numbers are outdated in the sense that they don't
+take into account that we'll build and test USB images as well soon,
+which impacts mostly the amount of data that need to be 1. transferred
+between the Jenkins master and workers; 2. stored on the Jenkins master.
+</div>
+
+EC2 does not support running KVM inside their VMs yet. Both Azure and
+Google Cloud support it. OpenStack supports it too as long as the
+cloud is run on KVM (e.g.
+[OVH Public Cloud](https://www.openstack.org/marketplace/public-clouds/ovh-group/ovh-public-cloud/),
+[OSU Open Source Lab](https://osuosl.org/services/hosting/details/)).
+ProfitBricks would likely work too as their cloud is based on KVM.
+
+There's a Jenkins plugin for every major cloud provider that allows
+starting instances on demand when needed, up to pre-defined limits,
+and shuts them down after a configurable idle time.
+
+After building an ISO, we copy artifacts from the ISO builder to the
+Jenkins master (and thus to nightly.t.b.o), and then from the Jenkins
+master to another ISO builder (if the branch _Needs Validation_ or one
+of our main branches) and one ISO tester (in an case) that run
+downstream jobs. These copies are blocking operations in our feedback
+loop. So:
+
+- If the network connection between pieces of our CI system was
+  too slow, the performance benefits of building and testing
+  faster may vanish.
+
+  Assuming a 1.2 GB ISO, 3.5 minutes should be enough for a copy
+  (based on benchmarking a download of a Debian ISO image from lizard)
+  ⇒ 2 or 3 × 3.5 = 8 or 10.5 minutes for an ISO build; compared to 1.5
+  minutes to the Jenkins master + 20 seconds to the 2nd ISO builder +
+  15 seconds to the ISO tester =~ 2 minutes on lizard currently.
+  In the worst case (Jenkins master on our infrastructure, Jenkins
+  workers in the cloud) it adds 5 or 8.5 minutes to the feedback loop,
+  which is certainly not negligible but is not a deal breaker either.
+
+- If data transfers between pieces of our CI system costed money we
+  would need estimate how much these copies would cost. On OVH Public
+  Cloud, data transfers to/from the Internet are included in the price
+  of the instance so let's ignore this.
+
+One way to avoid this problem entirely is to run our Jenkins master
+and nightly.t.b.o in the cloud as well.
+
+Pros:
+
+ * Scalable as much as we can (afford), both to react to varying
+   workloads on the short term (some days we build and test tons of
+   ISO images, some days a lot fewer), and to adjust to changing needs
+   on the long term.
+ * No initial money investment.
+ * No hardware failures we have to deal with.
+ * We can try various instances types until we find the right one, as
+   opposed to bare metal that requires careful planning and
+   somewhat-informed guesses (mistakes in this area can only be fixed
+   years later: for example, choosing low-voltage CPUs, that are
+   suboptimal for our workload).
+ * Frees lots of resources on our current virtualization host, that
+   can be reused for other purposes. And if we don't need these
+   resources, then our next bare metal server can be much cheaper,
+   both in terms of initial investment and on-going costs (it will
+   suck less power).
+
+Cons:
+
+ * We need to learn how to manage systems in the cloud, how to deal
+   with billing, and how to control these systems from Jenkins.
+
+ * On-going cost: renting resources costs money every month.
+
+   Very rough estimate, assuming we run all ISO builds and tests on
+   dynamic OVH `C2-15` instances (4 vCores at 3.1 GHz, 15 GB RAM, 100
+   GB SSD), assuming they perform exactly like my local Jenkins (4
+   i7-6770HQ vCores at 2.60GHz, 15 GB RAM), and assuming that no VAT
+   applies:
+
+    - builds & tests:
+      (30 minutes/build * 450 builds/month + 105 minutes/test * 350 tests/month)
+      / 60 * 0.173€ = 145€/month
+    - second build for reproducibility ([[!tails_ticket 13436]]):
+      30 minutes / 60 * 250 builds/month * 0.173€ =  22€/month
+    - total = 167€/month
+
+   Now, to be more accurate:
+
+    - Likely these instances will be faster than my local Jenkins,
+      thanks to higher CPU clock rate, which should lower the actual
+      costs; but only actual testing will give us more
+      precise numbers.
+
+    - Running a well chosen number of static instances would probably
+      lower these costs thanks to the discount when paying per month.
+      Also, booting a dynamic instance and configuring it takes some
+      additional time, which costs money and decreases performance.
+
+      We need to evaluate how many static instances (kept running at
+      all times and paid per-month) we run and how many dynamic
+      instances (spawned on demand and pay per-hour) we allow. E.g.
+      on OVH public cloud, a dynamic C2-15 instance costs more than
+      a static one once it runs more than 50% of the time. Thanks to
+      the _Cluster Statistics_ Jenkins plugin, once we run this in the
+      cloud we'll have the data we need to optimize this; it should be
+      easy to script it so we can update these settings from time
+      to time.
+
+    - We need to add the cost of hosting our Jenkins master and
+      nightly.t.b.o in the same cloud, or the cost of transferring
+      build artifacts between that cloud and lizard.
+
+      Our Jenkins master is currently allocated 2.6 GB of RAM and
+      2 vcpus. An OVH S1-8 (13€/month) or B2-7 (22€/month) static
+      instance should be enough. We estimated that 300 GB of storage
+      would be enough at least until the end of 2018. Our metrics say
+      that the storage volume that hosts Jenkins artifacts often makes
+      good use of 1000-2000 IOPS, so an OVH "High Speed Volume"
+      (0.08€/month/GB) would be better suited even though in practice
+      only a small part of these 300 GB needs to be that fast, and
+      possibly a slower "Classic Volume" might perform well enough.
+      So:
+
+       * worst case: 22 + 300×0.08 = 46€/month
+       * best case: 13 + 300×0.04 = 25€/month
+
+    - In theory we could keep running _some_ of our builds on our own
+      infra instead of in the cloud: one option is that the cloud
+      would only be used during peak load times for builds (but always
+      used for tests in order to fix our test suite brittleness
+      problems, hopefully). But if we do that, we don't improve ISO
+      build performance in most cases, and the build artifacts copy
+      problems surface, which costs performance and some development
+      time to optimize things a bit:
+
+       * If we run the Jenkins master on our own infra: only artifacts
+         of ISO builds run in the cloud during peak load times need to
+         be downloaded to lizard; we could force the 2nd ISO build to
+         run locally so we avoid having to upload these artifacts to
+         the cloud, or we could optimize the 2nd ISO build job to
+         retrieve the 1st ISO only when the 2 ISOs differ and we need
+         to run diffoscope on them (in which case we also need to
+         download the 2nd ISO to archive it on the Jenkins master).
+         But all ISO build artifacts must be uploaded to the nodes
+         that run the test suite in the cloud.
+
+       * If we run the Jenkins master in the cloud: most of the time
+         we need to upload there the ISOs built on lizard; then we
+         need to download them again for the 2nd build on lizard as
+         well (unless we do something clever to keep them around for
+         the 2nd build, or force the 2nd build to run in the cloud
+         too); but at this point they're already available out there
+         in the cloud for the test suite downstream job. During peak
+         times, the difference is that ISOs built in the cloud are
+         already there for everything else that follows (assuming we
+         force the 2nd build to run in the cloud too).
+
+ * We need to trust a third-party somewhat.
+
+ * To make the whole thing more flexible and easier to manage, it
+   would be good to have the same nodes able to run both builds and
+   tests. Not sure what it would take and what the consequences
+   would be.
+
+We could request a grant from the cloud provider to experiment with
+this approach.
+See
+[Arturo's report about how OONI took advantage of the AWS grant program](https://lists.torproject.org/pipermail/tor-project/2017-August/001391.html).
+
+## Dismissed options
+
+### Replace lizard
+
+Dismissed: our CI workload has too specific needs that are better
+served by dedicated hardware; trying to host everything on a single
+box leads to crazy hardware specs that are hard to match.
+
+Pros:
+
+ * No initial development nor skills to learn: we can run our test
+   suite in exactly the same way as we currently do.
+ * On-going cost increases only slightly (we probably won't get
+   low-voltage CPUs this time).
+ * We can sell the current hardware while it's still current, and get
+   some of our bucks back.
+
+Cons:
+
+ * High initial money investment.
+ * Hard to tell whether this would fix our test suite fragility
+   problems, and we'll only know after we've spent lots of money.
+ * Hard to specify what hardware we need. If we get it wrong, likely
+   we have to wait another 5 years before we try again.
+
+<a id="plan"></a>
+
+# The plan
+
+1. <strike>Check how the sysadmins team feels about the cloud option: are
+   there any blockers, for example wrt. ethics, security, privacy,
+   anything else?</strike> → we're no big fans of using other people's
+   computers but if that's the best option we can do it
+
+2. Keep gathering data about our needs while going through the next steps:
+   - upcoming services [intrigeri] (last updated: 2020-02-04)
+
+3. Describe our needs for each option:
+   - 2nd bare metal server-grade machine:
+     * <strike>hardware specs [intrigeri]</strike> [DONE]
+     * <strike>hosting needs (high power consumption)</strike> [DONE]
+   - the gamer option:
+     * <strike>hardware specs [groente]</strike> [DONE]
+     * <strike>hosting needs (high power consumption)</strike> [DONE]
+     * KVM over IP?
+   - <strike>the hacker option: hosting</strike>
+     [CANCELED: we won't do the hacker option]
+
+5. Ask potential hardware/VMs donors (e.g. ProfitBricks) if they
+   would happily satisfy our needs for free or with a big discount.
+   If they can do that, then let's do it. Otherwise, keep reading.
+   [WIP: intrigeri]
+
+6. <strike>Benchmark how our workload would be handled by the options we have
+   no data about yet:</strike>
+   - <strike>Rent a bare metal server for a short time, run CI jobs in
+     a realistic way, measure.</strike> [CANCELED: we have data about
+     Intel E-2134 which is good enough.]
+   - Find a friend who's really into gaming and has a threadripper setup to
+     run some CI jobs in a realistic way, measure.
+   - <strike>Rent cloud VMs for a short time, run CI jobs in a realistic way,
+     measure.</strike> [CANCELED: we won't do cloud.]
+
+7. Decide what we do and look into the details e.g.:
+   - bare metal options: where to host it?
+   - <strike>cloud: refresh list of suitable providers, e.g. check if more
+     providers offer nested KVM, if the OSU Open Source Lab offering
+     is ready, and ask friendly potential cloud providers such as
+     universities, HPC cluster admins, etc.</strike> [CANCELED: we won't do
+     cloud]