Skip to content
  • intrigeri's avatar
    jenkins-slave.service: enable the node *before* starting Jenkins · 1f60f1e0
    intrigeri authored
    It turns out that the enable_node() method, while exposed in the API as
    a setter, is actually implemented with a toggle operation, in a way that depends
    on the initial state:
    
            info = self.get_node_info(name)
            if not info['offline']:
                return
            msg = ''
            self.jenkins_open(Request(
                self._build_url(TOGGLE_OFFLINE, locals()), b''))
    
    This races against our ExecStart= in a TOCTOU manner. Before this commit, I saw
    enable_node() put my fastest Jenkins slaves *offline*. The logs are consistent
    with the following hypothesis:
    
    1. ExecStart= forks the Jenkins java process, which starts looking for
       a Jenkins master to connect to.
    2. enable_node() checks if the node is online. At this point, it is not,
       because the Jenkins slave did not manage to connect to the master yet.
       Therefore, enable_node() does not short-circuit.
    3. The Jenkins slave manages to connect to the master
       ⇒ the master now knows the node is online.
    4. enable_node() toggles the node's online status, which incorrectly
       puts it offline.
    
    … or something along these lines.
    
    To (hopefully) eliminate this race condition, let's do enable_node() *before* we
    start the Jenkins slave process, so they don't race against each other.
    
    Note that:
    
     - This will fully eliminate this race condition only if the API implements the
       Jenkins "toggle offline" operation in a blocking & synchronous way, both
       client-side and server-side.
    
     - This problem only happens because systemd has no idea what's the state of the
       java process, besides the fact it has started. If the Jenkins slave java
       process implemented the systemd startup notification protocol, then allow
       systemd could wait for the Jenkins slave to have successfully connected to
       the Jenkins master, before ExecStartPost= is executed; therefore,
       enable_node() would not race against anything anymore.
    1f60f1e0