1. 13 Apr, 2020 1 commit
  2. 17 Feb, 2020 1 commit
  3. 16 Feb, 2020 1 commit
    • intrigeri's avatar
      jenkins-slave.service: only enable a node if we marked it as temporarily offline (#10601) · 46ad3c62
      intrigeri authored
      The enable_node() method toggles offline status iff. the master thinks the node
      is currently offline, which can be the case:
      
       - either because the agent was not started yet (this is always true at this
         stage) *and* the master already understood that the agent that was running
         before the reboot got disconnected: that's racy so let's not rely on this,
         contrary to what the previous implementation did before this commit;
      
       - or because we marked the node as temporarily offline ourselves (before
         rebooting, in test_Tails_*).
      
      Let's eliminate at least one part of the race condition, by:
      
       - only fiddling with temporarilyOffline state when we really need to,
         i.e. when we explicitly put the node temporarily offline ourselves;
         this makes us stop rely on the exact timing of when the Jenkins
         master registers the fact an older agent got disconnected;
      
       - waiting for the Jenkins master to have registered this state change, before
         we start the agent; I'm not sure this is really needed, but I'm under the
         impression that the master processes requests to put a node back online in
         a non-blocking manner, which would race against the startup of the agent, and
         I'm not willing to bet on the opposite.
      
      I'm not sure this will be sufficient to fully fix this race condition.
      46ad3c62
  4. 30 Dec, 2019 1 commit
    • intrigeri's avatar
      jenkins-slave.service: enable the node *before* starting Jenkins · 1f60f1e0
      intrigeri authored
      It turns out that the enable_node() method, while exposed in the API as
      a setter, is actually implemented with a toggle operation, in a way that depends
      on the initial state:
      
              info = self.get_node_info(name)
              if not info['offline']:
                  return
              msg = ''
              self.jenkins_open(Request(
                  self._build_url(TOGGLE_OFFLINE, locals()), b''))
      
      This races against our ExecStart= in a TOCTOU manner. Before this commit, I saw
      enable_node() put my fastest Jenkins slaves *offline*. The logs are consistent
      with the following hypothesis:
      
      1. ExecStart= forks the Jenkins java process, which starts looking for
         a Jenkins master to connect to.
      2. enable_node() checks if the node is online. At this point, it is not,
         because the Jenkins slave did not manage to connect to the master yet.
         Therefore, enable_node() does not short-circuit.
      3. The Jenkins slave manages to connect to the master
         ⇒ the master now knows the node is online.
      4. enable_node() toggles the node's online status, which incorrectly
         puts it offline.
      
      … or something along these lines.
      
      To (hopefully) eliminate this race condition, let's do enable_node() *before* we
      start the Jenkins slave process, so they don't race against each other.
      
      Note that:
      
       - This will fully eliminate this race condition only if the API implements the
         Jenkins "toggle offline" operation in a blocking & synchronous way, both
         client-side and server-side.
      
       - This problem only happens because systemd has no idea what's the state of the
         java process, besides the fact it has started. If the Jenkins slave java
         process implemented the systemd startup notification protocol, then allow
         systemd could wait for the Jenkins slave to have successfully connected to
         the Jenkins master, before ExecStartPost= is executed; therefore,
         enable_node() would not race against anything anymore.
      1f60f1e0
  5. 28 Dec, 2019 1 commit
    • intrigeri's avatar
      jenkins-slave: simplify by dropping the flag file · 0e83ad37
      intrigeri authored
      A flag file is yet another kind of state we need to manage,
      and thus yet another area where trouble can happen.
      
      Let's instead unconditionally put the node online when the service starts,
      which gives us better self-healing in corner cases.
      0e83ad37
  6. 27 Dec, 2019 7 commits
    • intrigeri's avatar
      Fix syntax · ea84b162
      intrigeri authored
      "\" is interpreted by systemd; to pass a literal "\" to the shell,
      we need to escape it.
      ea84b162
    • intrigeri's avatar
      Fix copy'n'paste mistake · 6b8e97d0
      intrigeri authored
      6b8e97d0
    • intrigeri's avatar
    • intrigeri's avatar
      jenkins-slave: use a setter instead of a toggle to put the node back online · 39c72679
      intrigeri authored
      A toggle action depends on the initial state, which is inherently fragile
      and harder to reason about than a setter.
      39c72679
    • intrigeri's avatar
      jenkins-slave.service: put the correct node back online (#11295, #10601) · f503a529
      intrigeri authored
      On jenkins.lizard, the name of a node is its hostname.
      
      While on jenkins.sib, the name of a node is its FQDN (which is needed because
      its Jenkins slaves are on multiple machines).
      f503a529
    • intrigeri's avatar
      Fix copy'n'paste mistake · b0559166
      intrigeri authored
      Let's use a different "offlineMessage" when bringing the node back up, so that
      we can more easily notice if this toggle action ever puts a node offline (while
      it should put it back online).
      b0559166
    • intrigeri's avatar
      jenkins-slave.service: put node back online if needed (#11295, #10601) · bfb553f8
      intrigeri authored
      Waiting 30 seconds in the reboot_node job before putting the node
      back online is not always sufficient: sometimes jenkins-slave.service
      is still running at that point, and then a new test_Tails_ISO_*
      build is scheduled, which will fail early because we're rebooting
      the node.
      
      So instead:
      
       - I'll make reboot_node leave the node offline and create
         the /var/lib/jenkins/Taken_down_for_reboot flag file.
      
       - After the node has rebooted, when jenkins-slave.service starts,
         if the /var/lib/jenkins/Taken_down_for_reboot flag file exists,
         this new ExecStartPost= directive puts the node back online
         and deletes the flag file.
      
      This guarantees that the node is not brought back online
      before it has actually rebooted and is ready for new builds.
      bfb553f8
  7. 27 Mar, 2019 1 commit
    • intrigeri's avatar
      Make Jenkins slave service startup more reliable and fix its status... · 4eb0497e
      intrigeri authored
      Make Jenkins slave service startup more reliable and fix its status information (Closes: #8508, #11858).
      
      This fixes two related problems:
      
      1. The initscript tried to download the JAR only once. If the Jenkins master was
      not reachable at that time, the service would be considered to be correctly
      running, while it would not work at all. Let's now retry to download a JAR a few
      times, with a power-of-2 backoff delay, in order to wait for the Jenkins master
      to show up. And whenever the download script gives up eventually, systemd
      (with Restart=always) will itself restart the service a number of times.
      
      2. The service status (reported by the unit file auto-generated from the
      initscript) was always "running", because that unit file had Type=forking and
      the initscript runs plenty of commands before actually starting the service,
      which makes systemd guess incorrectly what process it should monitor. This made
      it impossible for our monitoring system to alert us whenever this service was
      not functioning. Converting to a native systemd service with Type=simple makes
      systemd report correct service status information.
      4eb0497e