-
intrigeri authored
It turns out that the enable_node() method, while exposed in the API as a setter, is actually implemented with a toggle operation, in a way that depends on the initial state: info = self.get_node_info(name) if not info['offline']: return msg = '' self.jenkins_open(Request( self._build_url(TOGGLE_OFFLINE, locals()), b'')) This races against our ExecStart= in a TOCTOU manner. Before this commit, I saw enable_node() put my fastest Jenkins slaves *offline*. The logs are consistent with the following hypothesis: 1. ExecStart= forks the Jenkins java process, which starts looking for a Jenkins master to connect to. 2. enable_node() checks if the node is online. At this point, it is not, because the Jenkins slave did not manage to connect to the master yet. Therefore, enable_node() does not short-circuit. 3. The Jenkins slave manages to connect to the master ⇒ the master now knows the node is online. 4. enable_node() toggles the node's online status, which incorrectly puts it offline. … or something along these lines. To (hopefully) eliminate this race condition, let's do enable_node() *before* we start the Jenkins slave process, so they don't race against each other. Note that: - This will fully eliminate this race condition only if the API implements the Jenkins "toggle offline" operation in a blocking & synchronous way, both client-side and server-side. - This problem only happens because systemd has no idea what's the state of the java process, besides the fact it has started. If the Jenkins slave java process implemented the systemd startup notification protocol, then allow systemd could wait for the Jenkins slave to have successfully connected to the Jenkins master, before ExecStartPost= is executed; therefore, enable_node() would not race against anything anymore.
1f60f1e0