hardware_for_automated_tests_take3.mdwn 21.2 KB
Newer Older
1
2
This is about [[!tails_ticket 11680]] and related tickets.

3
[[!toc levels=3]]
4
5
6
7
8
9
10
11
12
13
14
15

# Rationale

In 2016 we gave our main server some more RAM, as a temporary solution
to cope with our workload, and as a way to learn about how to scale
it. See [[blueprint/hardware_for_automated_tests_take2]] for our
reasoning, lots of benchmark results, and conclusions.

It's working relatively well so far, but we may need to upgrade again
soonish, and improvements are always welcome in the contributor UX
area:

16
17
18
19
 * When Jenkins has built an ISO from one of our main branches or from
   a branch that is _Ready for QA_, since October 2017 we rebuild in
   in a slightly different build environment to ensure it can be
   [[rebuilt reproducibly|blueprint/reproducible_builds]].
20
21
22
   This increased substantially the number of ISO image we build
   which sometimes creates congestion in our CI pipeline (see below
   for details).
23
24
25
26
27
28
29
30
31
32
 * On our current setup, large numbers of automated test cases are
   brittle and had to be disabled on Jenkins (`@fragile`), which quite
   a few problematic consequences such as: it decreases the value our
   CI system brings to our development process, it is demotivating for
   test suite developers, it decreases the confidence developers have
   in our test suite, and it forces developers to run the full test
   suite elsewhere when they really want to validate a branch.
   Interestingly, we don't see this much brittleness anywhere else,
   even on a replica of our Jenkins setup that uses nested
   virtualization too.
33
34
 * As we add more automated tests, and re-enable tests previously
   flagged as fragile, a full test run takes longer and longer.
intrigeri's avatar
intrigeri committed
35
   We're now up to 200 minutes / run. We can't make it faster by
36
37
38
39
   adding RAM anymore nor by adding CPUs to ISO testers. But faster
   CPU cores would fix that: the same test suite only takes 105
   minutes on a replica of our Jenkins setup, also using nested
   virtualization, with a poor Internet connection but a faster CPU.
intrigeri's avatar
intrigeri committed
40
41
 * Building our website takes a long while (12 minutes on our ISO
   builders i.e. 20% of the entire ISO build time), which makes ISO
42
43
44
   builds take longer than they could. This will get worse as new
   languages are added to our website. This is a single-threaded task,
   so adding more CPU cores or RAM would not help: only faster CPU
intrigeri's avatar
intrigeri committed
45
46
   cores would fix that. For example, the ISO build only takes 38
   minutes (including 6-7 minutes for building the website) on
47
48
   a replica of our Jenkins setup, also using nested virtualization,
   with a poor Internet connection but faster CPU cores.
49
 * Waiting time in queue for ISO build and test jobs is acceptable
intrigeri's avatar
intrigeri committed
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
   most of the time, but too high during peak load periods:

   - between 2017-06-17 and 2017-12-17:
     - 4% of the test jobs had to wait for more than 1 hour.
	 - 1% of the test jobs had to wait for more than 2 hours.
	 - 2% of the ISO build jobs had to wait more than 1 hour.

   - between 2018-05-01 and 2018-11-30:
     - We've run 3342 ISO test jobs; median duration: 195 minutes.
     - 7% of the test jobs had to wait for more than 1 hour.
	 - 3% of the test jobs had to wait for more than 2 hours.
     - We've run 3355 ISO successful build jobs; median duration: 60 minutes.
	 - 7.2% of the ISO build jobs had to wait more than 15 minutes.
	 - 2% of the ISO build jobs had to wait more than 1 hour.
	 - We've run 3355 `reproducibly_build_*` jobs; median duration: 70 minutes.
	 - 10% of the `reproducibly_build_*` jobs had to wait more than 15 minutes.
	 - 3.2% of the `reproducibly_build_*` jobs had to wait more than 1 hour.

   That's not many jobs of course, but this congestion happens
   precisely when we need results from our CI infra ASAP, be it
   because there's intense ongoing development or because we're
   reviewing and merging lots of branches close to a code freeze, so
   these delays hurt our development and release process.

74
75
76
77
 * Our current server was purchased at the end of 2014. The hardware
   can last quite a few more years, but we should plan (at least
   budget-wise) for replacing it when it is 5 years old, at the end of
   2019, to the latest.
78
79
80
81
 * The Tails community keeps needing new services; some of them need
   to be hosted on hardware we control for security/privacy reasons
   (which is not the case for our CI system):
   - Added already: self-host our website
intrigeri's avatar
intrigeri committed
82
83
84
   - Will be added soon:
     - [[!tails_ticket 16121 desc="Schleuder"]]
     - [[!tails_ticket 15919 desc="Redmine"]]
85
86
87
88
89
90
91
92
93
94
95
96
97
   - Under consideration:
     - [[!tails_ticket 14601 desc="Matomo"]] will require huge amounts
       of resources and put quite some load on the system where we run
       it; it needs to be hosted on hardware we control.
	 - [[!tails_ticket 9960 desc="Request tracker for help desk"]]):
	   unknown resources requirements; needs to be hosted on hardware
	   we control.
   - WIP and will need more resources once they reach production
     status or are used more often:
	 - Weblate is a serious CPU and I/O consumer; besides, part of its
       job is to rebuild the website, which is affected by the
       single-threaded task performance limitations described above.
	 - survey platform
intrigeri's avatar
intrigeri committed
98
99
 * We're allowing more and more people to use our CI infrastructure.
   This may increase its resources needs eventually.
100
101
102

# Options

103
104
105
106
107
108
109
## Bare metal server dedicated to CI

Hard to tell whether this would fix our test suite fragility problems,
hard to specify what hardware we need. If we get it wrong, likely we
have to wait another 5 years before we try again ⇒ we need to rent
essentially the exact hardware we're looking at so we can benchmark it
before buying.
110
111
112
113
114

Pros:

 * No initial development nor skills to learn: we can run our test
   suite in exactly the same way as we currently do.
115
116
117
 * Can provide hardware redundancy in case lizard suddenly dies.
 * We control the hardware and have a good relationship with
   a friendly collocation.
118
119
120

Cons:

intrigeri's avatar
intrigeri committed
121
122
 * High initial money investment… unless we get it sponsored or get
   a big discount from a vendor.
123
124
125
126
127
128
129
 * On-going cost for hosting a second server.

Extra options:

 * If we want to drop nested virtualization to get more performance,
   then we have non-negligible development costs and hard sysadmin
   problems to solve ([[!tails_ticket 9486]]):
130
131
132
133
134
135
   - We currently _reboot_ isotesters between test suite runs ⇒ if we
     go this way we need to learn how to clean up after various kinds
     of test suite failure.
   - Our test suite currently assumes only one instance is running on
     a given system ⇒ if we go this way we have to remove
     this limitation.
136

intrigeri's avatar
intrigeri committed
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
Specs:

 * For simplicity's sake, the following assumes we have Jenkins
   workers that are each able to run all the kinds of jobs we have.
 * CPU: assuming 2 cores (4 hyperthreads) per Jenkins worker, for
   8 workers, we need 16 cores at 3.5 GHz base frequency, which is
   roughly equivalent to 4 × the Intel NUC mentioned below.
   Our options are:
    - 4 × [quad-core CPU](https://ark.intel.com/Search/FeatureFilter?productType=processors&CoreCountMin=4&ClockSpeedMhzMin=3500&StatusCodeId=4&MarketSegment=Server&ExtendedPageTables=true&VTD=true)
      (on 2018-12-01, one result: [Xeon Gold 5122](https://ark.intel.com/products/120475/Intel-Xeon-Gold-5122-Processor-16-5M-Cache-3-60-GHz-) → 4×105 W = 420 W)
    - 2 × [octo-core CPUs](https://ark.intel.com/Search/FeatureFilter?productType=processors&CoreCountMin=8&ClockSpeedMhzMin=3500&StatusCodeId=4&MarketSegment=Server&ExtendedPageTables=true&VTD=true)
      (on 2018-12-01, one result: [Xeon Gold 6144](https://ark.intel.com/products/124943/Intel-Xeon-Gold-6144-Processor-24-75M-Cache-3-50-GHz-) → 2×150 W = 300 W)
    - 1 × [16-core CPU](https://ark.intel.com/Search/FeatureFilter?productType=processors&CoreCountMin=16&ClockSpeedMhzMin=3500&StatusCodeId=4&MarketSegment=Server&ExtendedPageTables=true&VTD=true)
      (2018-12-01: no result)
    - Higher-density systems, with 2+ servers in a chassis e.g.
      Supermicro Twin solutions, might allow using cheaper CPUs that
      don't support multi-processor setups.
 * RAM:  
   + 208 GB = 26 GB × 8 Jenkins workers  
   + Jenkins VM + host system + a few accessory VMs  
   = round to 256 GB; 192 GB might feasible with super fast storage
   (at least 4 × NVMe × 2 for RAID-1) if that's cheaper
 * storage:  
   + 480 GB = 60 GB × 8 Jenkins workers  
   + 600 GB for the Jenkins artifacts store  
   +  70 GB for APT cacher (make it cache ISO history too)  
   + Jenkins VM + host system + a few accessory VMs  
   = round to 1.5 TB × 2 (RAID-1)

166
167
## Custom-built cluster of consumer-grade hardware dedicated to CI, aka. the hacker option

intrigeri's avatar
intrigeri committed
168
For example, we could stuff 4-6 × Intel NUC or similar
169
170
171
together in a custom case, with whatever cooling, PoE and network boot
system this high-density cluster would need. Each of these nodes
should be able to run 2 Jenkins workers.
172
173
174

Pros:

175
176
177
 * Potentially scalable: if there's room left we can add more nodes in
   the future.
 * Probably as fast as server-grade hardware.
178
179
180

Cons:

181
182
183
184
185
186
187
188
 * Lots of initial research and development: casing, cooling, hosting,
   power over Ethernet, network boot, remote administration
 * High initial money investment (given the research and development
   costs we can't really try this option, either we go for it or we
   don't).
 * Hosting this is a hard sell for collocations.
 * We need to buy a node in order to measure how it would perform
   (as opposed to server-grade hardware that can be rented).
intrigeri's avatar
intrigeri committed
189
190
191
192
193
194
   OTOH:
   - We already have data about the Intel NUC NUC6i7KYK so if we
     pick a similar enough CPU we can reuse that.
   - If we buy one such machine to try this out and decide not to go
     for this option, likely this computer can be put to good use
     by a Tails developer or sysadmin.
195
 * On-going cost for hosting this cluster.
196

intrigeri's avatar
intrigeri committed
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
Availability:

 * Intel: as of 2018-12-01, none of the eighth generation NUC8i7
   support vPro, so the fastest models with vPro remain those, that
   share
   a [i7-8650U](https://ark.intel.com/products/124968/Intel-Core-i7-8650U-Processor-8M-Cache-up-to-4-20-GHz-)
   CPU, UCFF (4"x4") form factor, and 12-24 VDC DC input voltage:
   - kit: [NUC7i7DNHE](https://ark.intel.com/products/130393/Intel-NUC-Kit-NUC7i7DNHE),
     [NUC7i7DNKE](https://ark.intel.com/products/130392/Intel-NUC-Kit-NUC7i7DNKE).
   - board: [NUC7i7DNBE](https://ark.intel.com/products/130394/Intel-NUC-Board-NUC7i7DNBE)
 * Other vendors have started selling UCFF boards/kits with fast CPUs.
 * Similar smallish [[!wikipedia Computer_form_factor desc="form
   factors"]] would be worth investigating, e.g. there are plenty of
   [[!wikipedia Mini-ITX]] options on the market that could give us
   the high density we need.

213
214
## Run builds and/or tests in the cloud

intrigeri's avatar
intrigeri committed
215
216
217
218
219
220
221
<div class="caution">
Some of the following numbers are outdated in the sense that they don't
take into account that we'll build and test USB images as well soon,
which impacts mostly the amount of data that need to be 1. transferred
between the Jenkins master and workers; 2. stored on the Jenkins master.
</div>

222
223
224
225
226
227
EC2 does not support running KVM inside their VMs yet. Both Azure and
Google Cloud support it. OpenStack supports it too as long as the
cloud is run on KVM (e.g.
[OVH Public Cloud](https://www.openstack.org/marketplace/public-clouds/ovh-group/ovh-public-cloud/),
and maybe later
[OSU Open Source Lab](https://osuosl.org/services/hosting/details/)).
228
ProfitBricks would likely work too as their cloud is based on KVM.
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260

There's a Jenkins plugin for every major cloud provider that allows
starting instances on demand when needed, up to pre-defined limits,
and shuts them down after a configurable idle time.

After building an ISO, we copy artifacts from the ISO builder to the
Jenkins master (and thus to nightly.t.b.o), and then from the Jenkins
master to another ISO builder (if the branch is _Ready for QA_ or one
of our main branches) and one ISO tester (in an case) that run
downstream jobs. These copies are blocking operations in our feedback
loop. So:

- If the network connection between pieces of our CI system was
  too slow, the performance benefits of building and testing
  faster may vanish.

  Assuming a 1.2 GB ISO, 3.5 minutes should be enough for a copy
  (based on benchmarking a download of a Debian ISO image from lizard)
  ⇒ 2 or 3 × 3.5 = 8 or 10.5 minutes for an ISO build; compared to 1.5
  minutes to the Jenkins master + 20 seconds to the 2nd ISO builder +
  15 seconds to the ISO tester =~ 2 minutes on lizard currently.
  In the worst case (Jenkins master on our infrastructure, Jenkins
  workers in the cloud) it adds 5 or 8.5 minutes to the feedback loop,
  which is certainly not negligible but is not a deal breaker either.

- If data transfers between pieces of our CI system costed money we
  would need estimate how much these copies would cost. On OVH Public
  Cloud, data transfers to/from the Internet are included in the price
  of the instance so let's ignore this.

One way to avoid this problem entirely is to run our Jenkins master
and nightly.t.b.o in the cloud as well.
261
262
263
264
265
266
267

Pros:

 * Scalable as much as we can (afford), both to react to varying
   workloads on the short term (some days we build and test tons of
   ISO images, some days a lot fewer), and to adjust to changing needs
   on the long term.
268
 * No initial money investment.
269
 * No hardware failures we have to deal with.
270
271
 * We can try various instances types until we find the right one, as
   opposed to bare metal that requires careful planning and
272
273
274
   somewhat-informed guesses (mistakes in this area can only be fixed
   years later: for example, choosing low-voltage CPUs, that are
   suboptimal for our workload).
275
276
277
278
279
 * Frees lots of resources on our current virtualization host, that
   can be reused for other purposes. And if we don't need these
   resources, then our next bare metal server can be much cheaper,
   both in terms of initial investment and on-going costs (it will
   suck less power).
280
281
282
283
284

Cons:

 * We need to learn how to manage systems in the cloud, how to deal
   with billing, and how to control these systems from Jenkins.
285
286
287
288
289
290
291
292
293
294
295
296

 * On-going cost: renting resources costs money every month.

   Very rough estimate, assuming we run all ISO builds and tests on
   dynamic OVH `C2-15` instances (4 vCores at 3.1 GHz, 15 GB RAM, 100
   GB SSD), assuming they perform exactly like my local Jenkins (4
   i7-6770HQ vCores at 2.60GHz, 15 GB RAM), and assuming that no VAT
   applies:

    - builds & tests:
      (30 minutes/build * 450 builds/month + 105 minutes/test * 350 tests/month)
      / 60 * 0.173€ = 145€/month
intrigeri's avatar
intrigeri committed
297
    - second build for reproducibility ([[!tails_ticket 13436]]):
298
299
      30 minutes / 60 * 250 builds/month * 0.173€ =  22€/month
    - total = 167€/month
intrigeri's avatar
intrigeri committed
300
301
302

   Now, to be more accurate:

303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
    - Likely these instances will be faster than my local Jenkins,
      thanks to higher CPU clock rate, which should lower the actual
      costs; but only actual testing will give us more
      precise numbers.

    - Running a well chosen number of static instances would probably
      lower these costs thanks to the discount when paying per month.
      Also, booting a dynamic instance and configuring it takes some
      additional time, which costs money and decreases performance.

      We need to evaluate how many static instances (kept running at
      all times and paid per-month) we run and how many dynamic
      instances (spawned on demand and pay per-hour) we allow. E.g.
      on OVH public cloud, a dynamic C2-15 instance costs more than
      a static one once it runs more than 50% of the time. Thanks to
      the _Cluster Statistics_ Jenkins plugin, once we run this in the
      cloud we'll have the data we need to optimize this; it should be
      easy to script it so we can update these settings from time
      to time.

    - We need to add the cost of hosting our Jenkins master and
      nightly.t.b.o in the same cloud, or the cost of transferring
      build artifacts between that cloud and lizard.

      Our Jenkins master is currently allocated 2.6 GB of RAM and
      2 vcpus. An OVH S1-8 (13€/month) or B2-7 (22€/month) static
      instance should be enough. We estimated that 300 GB of storage
      would be enough at least until the end of 2018. Our metrics say
      that the storage volume that hosts Jenkins artifacts often makes
      good use of 1000-2000 IOPS, so an OVH "High Speed Volume"
      (0.08€/month/GB) would be better suited even though in practice
      only a small part of these 300 GB needs to be that fast, and
      possibly a slower "Classic Volume" might perform well enough.
      So:

       * worst case: 22 + 300×0.08 = 46€/month
       * best case: 13 + 300×0.04 = 25€/month

    - In theory we could keep running _some_ of our builds on our own
      infra instead of in the cloud: one option is that the cloud
      would only be used during peak load times for builds (but always
      used for tests in order to fix our test suite brittleness
      problems, hopefully). But if we do that, we don't improve ISO
      build performance in most cases, and the build artifacts copy
      problems surface, which costs performance and some development
      time to optimize things a bit:

       * If we run the Jenkins master on our own infra: only artifacts
         of ISO builds run in the cloud during peak load times need to
         be downloaded to lizard; we could force the 2nd ISO build to
         run locally so we avoid having to upload these artifacts to
         the cloud, or we could optimize the 2nd ISO build job to
         retrieve the 1st ISO only when the 2 ISOs differ and we need
         to run diffoscope on them (in which case we also need to
         download the 2nd ISO to archive it on the Jenkins master).
         But all ISO build artifacts must be uploaded to the nodes
         that run the test suite in the cloud.

       * If we run the Jenkins master in the cloud: most of the time
         we need to upload there the ISOs built on lizard; then we
         need to download them again for the 2nd build on lizard as
         well (unless we do something clever to keep them around for
         the 2nd build, or force the 2nd build to run in the cloud
         too); but at this point they're already available out there
         in the cloud for the test suite downstream job. During peak
         times, the difference is that ISOs built in the cloud are
         already there for everything else that follows (assuming we
         force the 2nd build to run in the cloud too).
intrigeri's avatar
intrigeri committed
371

372
 * We need to trust a third-party somewhat.
intrigeri's avatar
intrigeri committed
373

374
375
376
377
378
 * To make the whole thing more flexible and easier to manage, it
   would be good to have the same nodes able to run both builds and
   tests. Not sure what it would take and what the consequences
   would be.

379
380
We could request a grant from the cloud provider to experiment with
this approach.
intrigeri's avatar
intrigeri committed
381
See
382
[Arturo's report about how OONI took advantage of the AWS grant program](https://lists.torproject.org/pipermail/tor-project/2017-August/001391.html).
383

384
## Dismissed options
385

386
387
388
389
390
391
392
393
394
395
396
397
398
399
### Replace lizard

Dismissed: our CI workload has too specific needs that are better
served by dedicated hardware; trying to host everything on a single
box leads to crazy hardware specs that are hard to match.

Pros:

 * No initial development nor skills to learn: we can run our test
   suite in exactly the same way as we currently do.
 * On-going cost increases only slightly (we probably won't get
   low-voltage CPUs this time).
 * We can sell the current hardware while it's still current, and get
   some of our bucks back.
400

401
Cons:
402

403
404
405
406
407
408
 * High initial money investment.
 * Hard to tell whether this would fix our test suite fragility
   problems, and we'll only know after we've spent lots of money.
 * Hard to specify what hardware we need. If we get it wrong, likely
   we have to wait another 5 years before we try again.

intrigeri's avatar
intrigeri committed
409
410
<a id="plan"></a>

411
412
413
414
415
416
417
418
# The plan

1. <strike>Check how the sysadmins team feels about the cloud option: are
   there any blockers, for example wrt. ethics, security, privacy,
   anything else?</strike> → we're no big fans of using other people's
   computers but if that's the best option we can do it

2. Keep gathering data about our needs while going through the next steps:
intrigeri's avatar
intrigeri committed
419
   - upcoming services [intrigeri] (last updated: 2018-12-01)
420
421
422

3. Describe our needs for each option:
   - 2nd bare metal server-grade machine:
intrigeri's avatar
intrigeri committed
423
     * <strike>hardware specs [intrigeri]</strike> [DONE]
intrigeri's avatar
intrigeri committed
424
     * hosting needs (high power consumption) [intrigeri]
intrigeri's avatar
intrigeri committed
425
426
427
   - cloud (nested KVM, management tools in Debian), iff. the new
     sysadmin hired in 2019 knows is excited to learn and use
     such technologies.
428
429
430
431
432
   - the hacker option: hosting

5. Ask potential hardware/VMs donors (e.g. HPE, ProfitBricks) if they
   would happily satisfy our needs for free or with a big discount.
   If they can do that, then let's do it. Otherwise, keep reading.
intrigeri's avatar
intrigeri committed
433
   [intrigeri]
434
435

6. Benchmark how our workload would be handled by the options we have
intrigeri's avatar
intrigeri committed
436
437
438
439
440
   no data about yet:
   - Rent a bare metal server for a short time, run CI jobs in
     a realistic way, measure. [intrigeri]
   - Rent cloud VMs for a short time, run CI jobs in a realistic way,
     measure.
441
442
443
444
445
446
447

7. Decide what we do and look into the details e.g.:
   - bare metal options: where to host it?
   - cloud: refresh list of suitable providers, e.g. check if more
     providers offer nested KVM, if the OSU Open Source Lab offering
     is ready, and ask friendly potential cloud providers such as
     universities, HPC cluster admins, etc.