Jiri Kosina: stable trees and pushy maintainers

August 4, 2013

Participants: Jiri Kosina, Josh Boyer, and Greg KH, Ted Ts'o, Rafael J. Wysocki, Nicholas A. Bellinger, Steven Rostedt, John W. Linville, James Bottomley, H. Peter Anvin, Linus Torvalds, Guenter Roeck, Shuah Khan, Ingo Molnar, Li Zefan, Willy Tarreau, Rob Landley, David Lang, Tony Luck, Takashi Iwai, Mark Brown, Ben Hutchings, Paul Gortmaker, Jason Cooper, Dave Airlie, Kees Cook, Joe Perches, Kalle Valo, Jan Kara, Motohiro Kosaki, Trond Myklebust.

People tagged: Dave Miller

Threads merged into this one:

Jiri Kosina proposed discussing the criteria for deciding which patches go into the various stable trees. As one might guess from his title, Jiri believes that some less-than-stable patches are frequently added to the stable trees, for example, the random.c update in 3.0.41. This update included 902c098a, which Jiri says was buggy and not marked for -stable, and caused pain to distros, which Rafael J. Wysocki seconded. However, Greg KH pointed out that these commits were due to a security issue, which Ted Ts'o provided a link to. Jiri wondered why this wasn't in the changelog and why, given the wide deployment, such a rush was necessary. Ted plead “guilty” on the changelog's deficiencies, agreeing that even if there were security reasons to limit information flow, additional high-level technical information would have been good. However, Ted noted that he was not the person who pushed them to -stable, but speculated that getting these fixes into -stable might been of great value for embedded devices. In any case, Ted believes that this patchset was an exception to the normal -stable processes.

Josh Boyer seconded Jiri's concerns, noting that the first few stable releases in a given series were significantly less stable than later releases, almost as if they were release candidates rather than releases. On the other hand, Josh also noted that there have recently been significant lags between fixes being posted to LKML and eventual appearance in mainline, and that there clearly needs to be a balance.

Greg KH asked that this discussion be kicked off immediately at “stable at vger.kernel.org” instead of waiting, but said that he would be up for an in-person discussion as well.

Theodore Ts'o noted that Linus has rather firmly stated that the only fixes that should be pushed to mainline after -rc2 (or -rc3 at the latest) are for regressions or for very serious data-integrity issues. At that time, the concern was that careless late-in-cycle fixes for unimportant bugs might generate far more serious bugs. Ted wonders if the pendulum has now swung too far in the other direction. John W. Linville agreed that there seems to be some oscillation in the rules and their interpretation, stating that “a good repetitive flogging and a restatement of the One True Way to handle these things might be worthwhile once again”. In contrast with the practices for late-rc bug fixing, Greg Kroah-Hartman said that he has been consistent in enforcing the rules documented at Documentation/stable_kernel_rules.txt, and that even the SCSI maintainers were finally following them. James Bottomley objected to Greg's “finally following them” stating that the SCSI tree has had patches marked for -stable for quite some time. Rafael J. Wysocki further wondered why people complained to Greg rather than to the maintainer who marked the patch for -stable. Greg suggested that this was due to being an easy big target. A key theme running through this discussion was differences of opinion as to what fixes should be included in -stable trees, including differences in risk assessment.

In this spirit, David Lang kicked off a debate as to what level of risk is acceptable by arguing that a regression rate of one per ten fixes is insufficient. Tony Luck pointed out that Linux testing is carried out by inflicting changes on a gradually increasing pool of users over a multi-year timeframe, which means that there is a tradeoff between timely fixes and avoidance of regressions. Linus Torvalds agreed, but added that testing is usually self-selecting, so that the initial tests are carried out by the people who suffered from the bug, and who are thus likely to report an improvement even if there is some negative side effect. Linus therefore suggested that only the most critical fixes should be immediately sent to -stable, and that others could wait so as to get more testing. Greg Kroah-Hartman said that people already mark stable patches as follows:

Cc: stable  # delay for 3.12-rc4

Greg's workflow respects this sort of notation, so it can be used whenever needed. Willy Tarreau suggested that unadorned Cc to -stable be deferred by default, so that a patch would need to be tagged specially to be immediately applied to -stable.

H. Peter Anvin noted that it is not unusual for a patch to be flagged for -stable after Linus has pulled it to mainline. Peter would therefore like some out-of-band mechanism for flagging -stable patches. Greg said that such a mechanism already exists, namely sending the git SHA-1 to “stable at vger.kernel.org” along with the destination -stable trees. Greg also noted that some maintainers also keep separate trees to maintain commits destined for -stable. However, Theodore Ts'o pointed out that the current Docuemntation/stable_kernel_rules.txt currently says that you should send the patch, not just the SHA-1, and that he had been doing just this without seeing any complaints. Guenter Roeck says that he does the same, but also adds the SHA-1 commit ID from mainline. H. Peter Anvin clarified his original request, stating that he wanted better automation of this process, suggesting something based on git notes. Linus said that while he was OK with maintainers using git notes locally (adding that they can be very powerful for certain workflows), he would neither pull them to nor push them from mainline. Steven Rostedt speculated that a process based on git notes could be made to work even given that Linus wasn't going to pull them into mainline, for example, by polling mainline and upon seeing a commit appear there, checking the local tree for git notes. Shuah Khan added that such a process could include quick sanity tests to make sure that the flagged patches applied cleanly to the relevant -stable trees. Greg KH echoed Linus in saying that he would not be using git notes in his -stable trees, further asking if it was really all that hard to just remember what has been marked for -stable, for example, by placing the patch in a mailbox or a separate git tree. H. Peter Anvin argued that the value of something like git notes was that it preserved information on why and how the patch made it to -stable. Ingo Molnar countered by saying that one advantage of a the limited-time acknowledgment of review and testing contributions is that it encourages this review and testing to happen in a timely manner. Takashi Iwai would nevertheless like to see some sort of metadata linking a buggy commit with its fix, perhaps via tags or notes. Takashi also considered the option of linking from the fix to the buggy commit, but argued that this makes reverse mapping (of interest to bisection) harder. [Editor's note: It appears that there is great scope for creativity in workflows interacting with -stable.]

Nicholas A. Bellinger agreed with the danger that late-in-cycle fixes might reduce rather than increase stability, and gave an example from iSCSI where he delayed mainlining a fix for exactly this reason. The fix required too large of a change and too much manual testing to justify addition to a late -rc release. Steven suggested git cherry-pick -x to place such commits into a separate branch of the maintainer's main git tree, but Linus expressed a strong preference either for identical SHA-1 IDs in a separate git tree or identical commit summary lines. Linus also said that he would much rather see a given fix committed twice by two maintainers than to have cross-maintainer dependencies, at least assuming that it is a reasonably small and contained fix.

Steven Rostedt suggested the following criteria for the -rc levels:

-rc1-3: Take all bug fixes.
-rc4-5: Take only regressions and more serious bugs.
-rc6-on: Tak only critical bug fixes.

The discussion must have become too boring for James Bottomley, who suggested dispensing with the “Cc: stable” tags entirely in favor of having the maintainer be directly responsible for sending patches to -stable. Steven Rostedt suggested keeping these tags, but changing the workflow so that patches not be accepted into -stable without the approval of the relevant maintainer. Willy Tarreau disagreed with Steven, stating that the current process already involves maintainer review. Paul Gortmaker suggested that -stable trees for older kernel versions should be more strict than the N-1 stable trees, given the higher risks inherent in applying patches to older kernel versions. The ensuing discussion raised concerns about the scalability of the current process (along with some contention over what “scalability” even meant in this context), concerns about losing patches needed for -stable, risks of bad patches appearing to apply without errors, challenges of managing -stable trees for old kernel versions, and tutorials on how the various -stable tree maintainers manage the workflow.

Several -stable maintainer noted that they simply took whatever patches Greg KH took, which caused Steven Rostedt to raise concerns about Greg's mortality (Steven also noted that because there are many -stable maintainers but only one Greg KH, that scalability concerns should try to push work away from Greg and onto the -stable maintainers). Greg replied that his workflow was highly and publicly documented and scripted, and that his requests for specific help have rarely been answered, but that his joining Linux Foundation now lets him focus on -stable as a part of his day job.

Greg KH listed the following two issues that he had seen in the thread:

Patches that shouldn't be in -stable because they do not do anything.
Patches without a clear justification for backporting.

Greg will address the first issue by pushing back more firmly on such patches. Greg believes that the second issue is almost entirely due to fixes for security issues, and that this has been addressed by recent changes that ensure that distros become aware of security problems and their fixes. Greg also said that the subsystem maintainers have the final say in how their work feeds into -stable.

H. Peter Anvin suggested that the kernel-summit discussion be about different -stable workflows and what the maintainers' options are rather than about a specific proposal, to general acclamation. This included James who expects to look into different workflows based on this discussion. Peter also offered to present on the -tip tree workflow.

Finally, H. Peter Anvin called out the risk of getting too hung up on policy. Different fixes at different times in different subsystems may need to be handled in different ways.