Greg Kroah-Hartman gave the keynote at this year's Linux Plumbers Conference (see The Linux Ecosystem, what it is and where do you fit in it?).
Greg defined the "Linux ecosystem" as a series of interconnected projects, primarily the Linux kernel, GCC, X.org, binutils, glibc, and the Linux man pages. Though a GNU/Linux distribution includes much more software, though parts of this ecosystem are available on other free Unix-like systems (including the *BSDs and Open Solaris), and though you can remove X.org on many embedded devices, this is the minimal group unique to all GNU/Linux-based systems. Thus it's the important and useful infrastructure common to anything readily identifiable as a Linux-based system. Any distinguishing flavor of Linuxness comes from this combination of projects.
Who Contributes to the Linux Ecosystem?
The Linux Foundation published a report earlier this year about contributions to the Linux kernel (written by Greg, Jon Corbet from LWN, and Amanda McPhearson from the Linux Foundation). The study demonstrates both the tremendous rate of change in and contribution to the Linux kernel.
Greg had updated numbers in his keynote. The Linux kernel has grown by 99324 patches in the past three years, from 2.6.15 to 2.6.27 -- that's almost a hundred changes every day. The contribution-tracking heuristics credits Red Hat with 11846 patches, Novell with 7222, Mandriva with 237, and Gentoo with 229.
The largest group of contributors to the Linux kernel is people not aligned with any particular business or company. They produced 17% of all tracked patches. 8.3% of all patches came from people with unknown alignment. Together that represents a quarter of all contributions to the Linux kernel which apparently come from amateurs.
37% of the contributions to GCC (by the same metrics) are from amateurs.
Red Hat produces 26.8% of the patches to X.org, with 18.8% coming from users of unknown affiliation, 12% from Intel, and 2.1% from the NSA. Note that Greg mentioned that tracking X.org patches is difficult because lead developer Keith Packard commits changes from several separate machines; collating his contributions is difficult. (Keith works for Intel.)
Is It the Size of the Contribution, Or...?
The statistics are interesting, but it's difficult to draw meaningful conclusions from them for two reasons. First, there are too many questions about contributor affiliations to draw definitive data. Though O'Reilly pays me in part to be a subject matter expert on F/OSS, my role here is as an editor and writer, not a programmer. If I submit a patch to a free software project inspired by my work duties, should O'Reilly get corporate credit for my work? If I submit a patch outside of work, who should receive credit? What would happen if another company hired me to continue working on a free software project I participate in as a hobby?
Second, Cano nical CTO Matt Zimmerman disagrees with the report's statistical conclusions. Red Hat is a large, well-established, and profitable company which can afford to hire and fund many developers. If Red Hat weren't producing as many patches, the community might rightly question the company's commitment to free software and the ecosystem. Yet I'm not sure it's possible to produce a meaningful metric of how much any existing company should contribute.
Users Who Don't Provide Feedback are Useless
Greg made a throwaway comment containing a point too important to get lost in arguments over statistics and sampling. Contributing to the health of common infrastructure is a primary duty of downstream parties.
Red Hat, Canonical, Slackware, Novell, Debian, Mandriva, IBM, Google, Dell, HP, Montavista, and Gentoo all benefit from the timely, well-maintained, and featureful development from thousands of upstream projects. In return, these groups make the work of these projects available to millions of eager users. More users tends to mean more bug reports, more feature requests, and, above all, more feedback -- which is the primary benefit to upstream developers. Most of all, we want answers to simple questions. Does it work? Is it useful? What more can we do to delight you?
The best possibility is to receive a patch containing documentation, well-designed and well-implemented code, and appropriate tests for a bug or new feature. If it applies cleanly, builds and passes tests on the relevant systems, and fits with the project's goals, even better.
Yet even just knowing that a recent commit caused a test failure -- and getting debugging information from the relevant system -- is valuable. Canonical's kernel team may not have the expertise to diagnose and debug errors in the kernel's SCSI subsystem related to the use of a particular flag in combination with a new chipset present in the latest revision of a hard drive controller. (Few people do.) Yet with the greater userbase of Ubuntu, it's likely that Canonical's kernel team may receive such a bug report while the Linux kernel developers may not.
The process only works when Canonical's kernel team reports that bug and all appropriate debugging information upstream. Submitting patches upstream would be great, but a well-produced bug report from experienced developers and troubleshooters is likely sufficient information for upstream to find and fix the bug.
Not all bugs are worth reporting upstream, of course. It's difficult to fix unreproduceable bugs, or bugs without debugging information. As well, distributors rarely distribute upstream's most recent code, so some fixes may be as simple as backporting patches from newer versions.
Some information must flow upstream, however. How much? Everything potentially valuable.
When Red Hat applied a development patch to its Perl without consulting the Perl 5 developers, the result was a 100-fold slowdown in certain circumstances. The problem is that no one upstream knew that Red Hat had integrated that patch into a stable release -- especially when the Perl 5 developers saw the performance regression and superseded that patch before ever releasing it in a stable product.
When information flows only one way, the result is a fork in everything but name. Bugs get reported to the distribution's tracker, patches get applied to the distribution's version, and users use the distribution's packaged version while believing (based on the name) that they're using upstream's version. Though bugs and feature requests often get reported to the distribution, upstream may get blamed for the problems.
Sadly, the accepted wisdom of the Perl community, at least, is to build and install a custom version of Perl alongside the distribution's version. Users who know this have to maintain multiple Perl installations, while users who do not know this yet have to suffer potential downstream misconfigurations. It's not clear what value distributions provide in these scenarios.
Most free software licences allow this, and there are few legal mechanisms to enforce such behavior (though Mozilla's trademark dustup with Debian is an interesting potential counterexample). Even so, the pragmatic arguments for maintaining regular contact with upstream are strong. It's better for the users. It pushes the responsibility for making good decisions to the most experienced people. It preserves the feedback cycle which is so important to successful community-driven development.
Sustainable Upstream Development
Distributions provide valuable services in packaging, distribution, and service -- this is especially true when integrating thousands of upstream projects into a coherent, unified whole. It's no wonder that comparatively few users follow any project's releases compared to the packaged versions available for upgrade every few months.
Even as upstream often depends on downstream to make software available to millions of users, downstream depends on upstream to produce high-quality software which millions of users need, want, and use. Unlike middlemen who seemingly exist to get a cut of the markup difference between wholesale and retail prices, the separation of upstream and downstream often provides tangible advantages for all groups concerned.
Yet that separation cannot be complete, nor can the flow of information be unidirectional. Without users, a project need not exist. Without feedback from users, a project might as well not exist. Without credit, or distribution, or a steady stream of new developer interest, a project may well wither and vanish.
Concentrating on the amount of contribution from downstream to upstream misses a much more important point. The number of patch contributions upstreams doesn't matter. The number of potential contributions of any kind which remain in downstream eddies matters. That number should be zero.