(and what we want to do to avoid similar problems in the future)
A few days ago, a broken xserver-xorg-core was
dapper-updates. This caused, unsurprisingly, a large
amount of bug reports. A fixed package was uploaded about 12 hours
after the first bug report came in. We also pointed all the
au.archive.ubuntu.com, etc) to archive.ubuntu.com to speed up the
propagation of the fix. Even so, it hit far, far too many users.
Incidentially, the distro team is currently in a sprint in Wiesbaden, Germany and we had a large discussion this morning about both how to handle such situations when they happen (and recover from them, mostly in the technical sense) and how to prevent them from happening in the first place. The latter is obviously more important as we won’t have to recover if there is not a problem in the first place. Note that those ideas listed below were ideas and not finished procedures and we are going to write up a proper policy document.
Ideas for prevention ranged form using
$distro-proposed and explicit
call for testers of that to getting more code review for updates.
The current review process for updates to
$distro-updates is a review
by the release manager who accepts it into
update passed that review even though the update was faulty, so even if
we had another reviewer, it might have passed that review too.
$distro-proposed does not prevent the problem from affecting
anybody, it just changes the group affected with the goal that the group
choosing to enable
$distro-proposed will hopefully be able to recover
Recovery ideas ranged from being able to put an update onto a user’s machine whether he wanted it or not to snapshot/rollback support and having some kind of a “safe mode” where it would give you a kdrive-based VESA X server and tools to fix your system. I’m not sure what we would do if we managed to break one of those tools (or the “safe-mode” X server), though.
We all agreed that being open about the problem, the cause of the problem and how we are working to solve it is important. Downplaying the severity or making jokes or sarcastic comments about the fix (“Ooops, we did it again”) is bad and something we shouldn’t do. (And I don’t think we did it this time either.) No response is equally bad and something we should work hard to avoid.
Hopefully, we will have a procedure in place in a little while which
tells us how to handle such an emergency when it happens and we will be
deploying safeguards to prevent it from happening again while not ending
up paralysing us and making us unable to deploy updates to
$distro-updates as this is something we need to be able to.
While this post is fairly critical of the current set of policies and procedures, I have tried hard to avoid pointing fingers at anybody in particular. I would much rather have us have a positive and constructive discussion about how to avoid similar problems in the future than a discussion on who is to blame. The points and views in this post is also those of my own and not any kind of official communication from Canonical or Ubuntu.