Open source projects are the backbone of today's technology . We find them in programming languages that are at the core of any software (e.g. Luna or Rust), frameworks that shape the web (e.g. Laravel, React or Angular), toolkits that help scientists handle and analyse data (e.g. numpy, pandas) or development tools used by all software teams around the world (e.g. git), to name a few examples.
Do successful open source projects have common patterns?
Through this project, we hope to gain some insights into what contributes to the success of an open source project. This might allow us to identify common patterns among successful projects , and formulate some guidelines that might be beneficial to other projects' success.
The number of commits per project is shown on the left. Can you see the exponential trend for the distribution? Different projects have different sizes, which is why our results will be in percentages when we're comparing projects directly.
Interesting tidbit: curl is the oldest project in this list and was first released in 1997. Tensorflow is one of the newer projects, appearing only three years ago (2015). The age isn't always related to the amount of commits!
In git, an author is the person who wrote the changes and the committer is the person who applied them on behalf of the author. While they're often the same, it doesn't always have to be the case.
A quick look at the author and committer counts tells us that there are mostly small differences, with the number of authors always being larger than the number of committers. Why might this be? Why do some people commit others' work?
Not all projects are the same: some heavily leverage this feature, others not so much. Certain people might have a special role: supervising commits; they review others' work then commit them on their behalf.
Let's take a closer look at this...
On the right, we're showing the top 3 "commit supervisors" for each project. Depending on the project, these supervisors have a different workload.
Interesting tidbit: some projects actually rely on bots for these roles, not necessarily humans (e.g. Caffe2 and Tensorflow).
We consider both the frequency of commits (average interval between consecutive commits) and a normalized frequency (average interval between consecutive commits per line of code committed). The normalized frequency takes into account the size of the commit, making it more resilient measure (for e.g. frequent small commits vs rare large commits).
In short, these projects are extremely active. On average, a new commit is made at least every day! Alternatively, a line of code is committed every 15 minutes on average!
Interesting tidbit: Rust has a new line committed every 26 seconds on average!
To check for time patterns in contributors' habits, we decided to take a look at daily and hourly distributions of the commits.
The hourly commits distribution is shown below — you can select which project you're interested in from the dropdown list, or quickly scroll through them using your cursor keys.
Most projects see a sharp drop during the weekend: these coincide with projects owned/managed by companies, which makes sense due to the working week and most employees resting during the weekends.
Some notable exceptions do not show this sharp drop, like Pandas, Numpy and Flask (and some others to a lesser extent). These actually happen to be mostly community-driven projects.
We found this out by manually researching each company, but we will reach similar results below on by a data-driven analysis (email domains).
The hourly commits distribution is shown below — you can select which project you're interested in from the dropdown list, or quickly scroll through them using your cursor keys.
We see that all projects follow a similar hourly distribution: the activity increases to reach a peak, then decreases constantly and hits a minimum for 6 to 8 hours or so.
This corresponds to most of the commits happening during the working period, with the least activity happening during the evening and at night.
For team-driven projects, a lot of the work will be happening during the work hours. For other projects and contributors in general, the contributors will also be following daily schedules (i.e. they sleep at night).
While projects will have contributors from all around the world, the majority is likely to be from specific countries (e.g. China, India and the US are respectively ranked 1st, 2nd and 3rd in terms of Internet users, source) or specific to that project (e.g. Nerv's distribution is shifted towards the right, which coincides with the fact that a chunk of its community is Chinese).
To find out if some projects are heavily contributed to by specific communities or teams, we analyzed the proportion of commits whose committers or authors share non-common email domains (e.g. "@microsoft.com" would be considered, but "@gmail.com" would be ignored). We only focused on the top 5 domains for each project that are responsible for at least 1% of the project's commits.
We noticed that projects could be divided into two groups: community-driven ones, and company-backed/sponsored one. In all cases, we see the presence of a (relatively) small but vital development team.
A clear example of a company-backed project is CNTK, for which Microsoft employees author 86% of the project's commits! (Other examples include Spark, Caffe2 and Angular.)
On the other hand, an example of a community-driven project is Flask, where the project's core maintainers are responsible for 41% of all commits.
To see if other roles/jobs exist among contributors to a project, we analyze several features relating to the following potential roles: refactoring code, dealing with bugs, dealing with merges and dealing with issues reported on GitHub.
We consider code refactoring to be restructuring existing code without changing functionality, which in our experience is accompanied by a non-negligible decrease in lines of code. We analyzed the tendency of a committer to remove lines and looking only at authors with a high tendency to do so (at least 10% of their overall work is removing lines of code). This was done for a more robust measure.
We found out that on average, around 7% of authors mostly remove code. It looks like a good chunk of authors have a particular role: refactoring code.
Commit messages briefly summarize the content and aim of each commit. By analyzing commit messages for keywords indicating that the patch resolves an issue or a bug, we computed the percentage of authors dealing with issues for each project.
We found out that, on average, 55% (stddev=10%) of authors are involved with dealing with issues / bugs.
This shows the importance of this activity, and indicates that the role is distributed over a wide range of authors. This makes sense because each author can be responsible for a portion of the project, or some authors focus on issues that affect them personally.
In short, dealing with issues is an important role which involves half of a project's contributors on average.
When multiple people work on a single project, merging their work and resolving any conflicts in their changes is often required. We conducted a similar analysis on commit messages, this time looking for keywords related to merging. We found out that, on average 19% (stddev=11%) of authors are involved with dealing with merges.
This suggests that dealing with merges is a non-negligible role involving around one-fifth of a project's contributors on average.
GitHub allows users to open issues on its web portal. This is commonly used to report bugs or ask for the addition features.
To check for the existence of this role, we analyze contributors responsible for closing a significant portion of issues (at least 5%) for each project.
We noticed that on average, a couple of users (mean=3, stddev=1.3) close a sizeable amount of issues. This suggests that dealing with issues reported by the community on GitHub is another common role present in all these projects.
Several online communities are available for developers to learn and share their programming knowledge. The success of a project might be related to how responsive and helpful its online community is. We investigated this by looking at two communities popular with programmers: StackOverflow and Reddit.
To explore the popularity of each project on StackOverflow, we computed the number of questions related to each project based on StackOverflow's tags. This is a good measure of the popularity of each project.
To measure the responsitivity of the StackOverflow community, we calculated the percentage of questions with accepted answers per project. On average, 42% (stddev=13%) of StackOverflow questions have an accepted answer which is a relatively good indicator of a community's responsitivity. We can say that due to two aspects: first, the project has to be popular enough to have enough people asking questions about it. Second, the community is active and resposive enough to answer around half of the questions in a satisfactory manner (other questions might also be answered, but the poster didn't select an accepted answer in those cases).
We were also curious about the average resolution time of questions on StackOverflow, i.e. the time taken for the accepted answer to be posted. The results show that, on average, it takes around 9 days for a question to receive an answer that is marked as accepted.
To get each project's popularity on Reddit, we counted the number of threads that mention it on a dozen large programming-related subreddits. Rust seems to be a favorite on Reddit! Unexpectedly, Angular isn't as popular as we thought it would be compared to the number of questions about it on StackOverflow. Despite that, it is generally safe to say all of these projects are still quite popular.
Plotting the time range of the threads count per project over the last years, we can definitely observe a clear upward trend for the projects. This is a good indicator about their popularity, with most of them gaining popularity (or at least remaining relatively popular) over time.
(Some of the projects have a narrower scope than others and won't be mentioned as much in programming subreddits, but they still have a good presence in their own community; the r/mesos subreddit is an example of that.)
Interesting tidbit: you can see that the popularity of AngularJS (version 1, legacy) went down while that of Angular (version 2) went up during the last two years.
We wanted to see how the communication atmosphere among project contributors is, as well as between the contributors and the community. We explore inter-project members communication through commit messages, and the community communication through GitHub issues. The comments of GitHub issues can also give us insights about the general tone of back-and-forth communication between project members and the community. We perform sentiment analysis for each of these three categories.
A sentiment analysis in commit messages, GitHub issues and comments shows that the neutral sentiment is highly dominating.
However, there are still some noticeable differences between these three categories: the commit messages are mostly full of neutral sentiments. This is expected since commit messages have a descriptive and informative function (to provide concisely and objectively the commit's purpose); while GitHub issues also have mostly neutral sentiments, the amount of positive sentiments (on average ~4.6% with stddev ~1.5%) is significantly higher than the negative ones (on average ~0.5% with stddev ~0.5%).
In a nutshell, the neutral sentiment is highly dominating in the communication between the team members and between the community. When replying to the GitHub issues in the form of GitHub comments, the sentiment tends to be more positive than negative.
These are interesting results since the type of communication seems to play an important role in these community-driven projects. An objective (neutral) and non-negative tone suggests a better communication between the community and project members, as well as between project developers themselves.
We wanted to determine whether or not there were additional common features by further exploring the remaining data. We decided to look into commit summary length, the percentage of closed issues on GitHub and their average resolution time.
The first line of the commit message is considered a concise summary of the commit. Git guides recommend making commit summaries around 50 characters, providing enough information but not too much.
We calculated the average length of commit summaries for each project, and noticed that most seem to follow this recommendation. In fact, going further and checking the percentage of commit summaries that have around 50 characters (between 35 and 60), 44% (stddev = 6.6%) of them on average followed this recommendation.
We calculated the percentage of closed issues on GitHub for each project. We found out that on average, 87% (stddev=11%) of issues are closed. This is a good indicator of a project's interactiveness with its community, including bug reports/feature requests (issues) and community contributions (pull requests) solving bugs/adding features. On one hand, this directly enhances the project by solving present issues or adding features. On the other hand, this indirectly incentivizes the community to keep on contributing, either in the form of reporting issues or sending pull requests.
We define the resolution time for a given issue as the time it took to close it. It is an important measure of the responsivity of the authors to the community and the activity towards solving problems and further improving the project.
For each project, we computed the average resolution time for issues. Interestingly, issues take 57 days (stddev=40) to be resolved on average.
This is highly variable from one project to another, with some projects being much faster in resolving issues (a couple of weeks), and other beings much slower (a few months). The big range of the resolution time is likely due to the nature of the project and its issues, some projects/issues being inherently more complex than others. This means that issues require more time to properly debug and fix. Still, the projects are responsive in most cases, with the 2nd quartile (50%) of average issue resolution time being a couple of months on average. This makes sense considering the amount of time required to inquire about the issue, debug it, fix it and test the final outcome.
In short: most issues are resolved within a couple of months on average. This is a good indicator of a project's activity and responsiveness.
Through this ADA project, we were able to identify a series of common features among successful GitHub projects. These might be useful as guidelines for future projects: