The "Waiting for Persistent Storage to be ready" thing essentially waits for "tpscli is-active AdditionalSoftware" to return true, which can happen before we are done running the create-live-additional-software-config hook (hooks are run after a feature's "active" state is changed). And then, before this commit, we had 2 processes who were are writing the same file without coordination, and e.g. this TOCTU could happen:
- the AdditionalSoftware feature becomes active so asp-handle-package-changes stops waiting
- the hook checks for existence of the config file, and finds none, so it determines it should run its create function
- asp-handle-package-changes adds the package to the config file (creating it, in passing)
- the hook's create function runs "install […] /dev/null live-additional-software.conf", which silently overwrites the already config file that just got created with a new, empty one.
Maybe what happens in the failing cases I've seen is another kind of way of losing the race condition, we'll never know, but whatever: I think it's clear there is a race condition here, and somehow we're sometimes losing it.
In the previous implementation of Persistent Storage, it made some sense that live-persist and tails-persistence-setup ensured existence and proper permissions of this file, because that did that before any ASP operation, so this allowed the ASP code to operate in a known-good environment. But now that the new implementation cannot easily provide that anymore, so it can as well leave the config file alone, and leave to the ASP code the responsibility of creating it (which it de facto sometimes did already when we lost the race) and ensuring proper permissions (which it already does every time it updates the content of the file with the write_config function). This is what this commit does.
The only thing we lose here is the call to "setfacl --remove-all". If I got the Git history right, we've never programmatically set ACLs on that file, so I understand this call was only meant to ensure there's none, as expected. I believe this was inherited from generic live-persist checks, that checked for expected ACLs on other files/directories where it did matter. I don't think this is useful here in practice.
Closes #19926 (closed)