Skip to content
July 3, 2013 / milllss

Some notes on github Pull Requests

Pull requests are an interesting aspect of ‘social coding’ on github. Any user can fork any repository, work on it, push their changes, and submit a Pull Request to the parent repository. The Pull Request is reviewed by the parent repository’s owner(s) and if accepted the changes can be automatically merged into the parent repository.

This sequence of events is known as the ‘Fork to Pull’ model and there are already some studies of its prevalence on github. It allows individuals who are not part of a project to contribute in an ad hoc manner, sometimes leading to sustained contribution, sometimes a one-off event (known as ‘drive-by commits’).

Pull Requests on the github timeline – accessed through BigQuery

Pull Requests are one of the event types recorded on the github timeline (which has around 3.4 million events of this type). There follows a rudimentary exploration of what the data on pull requests can tell us, based on a table of the most recent 100k Pull Request Events.

The first thing to note is that a row of data is written to the timeline whenever a Pull Request is opened or closed, so each Pull Request will tend to have two ‘Pull Request Events’ as recorded on the timeline. In the sample of 100k rows: 52,743 relate to the opening of a pull request, 46,630 relate to the closing of a pull request, 627 relate to the reopening of a pull request. Of the 46,630 events which relate to the closing of a pull request – the pull request has been merged in 33,567 cases (72%), and not merged in the remaining 13,204 cases (28%). Are these cases where the pull request has been rejected by the repo’s owner/maintainer?

When a pull request is submitted on github automated tests are run which check whether it is technically possible to merge the pull request – the result of this check is recorded in the ‘payload_pull_request_mergeable’ variable, which is False for 3,391 of the Pull Request Events.

Each row of timeline data for pull requests has a host of variables relating to the ‘head’ (repo which contains the changes to be merged) and ‘base’ (repo which changes are to be merged into) repositories. In 13,963 Pull Request events the head repo is the same as the base repo. These cases represent the use of pull requests where they are not required (the changes could have been pushed or merged by the contributor without a pull request being lodged and approved) – a divergence from the prototypical fork and pull request model. Similarly, there are 6,556 rows of data in this sample (14% of all rows which relate to the closing of a request) where the user who created the pull request is the same user who merged it.

A comment stumbled upon in the description of a repository (arsduo/newsgirl) supports the idea that some projects use pull requests (where they are not required) as a way to track changes to the master branch.

At 6Wunderkinder, no code goes into master except via pull request merged after group review. We’ve found this process to be very worthwhile both in ensuring quality (it’s a lot easier to raise a question in a safe group setting than one on one) and in diffusing knowledge of our systems to the entire team.

Pull Requests on ghtorrent

GHTorrent is (another) project which mirrors data from github. I came across it yesterday and I think it looks very useful – it stores the data in a variety of tables on a MySQL database, which can be queried in a browser window here or downloaded as sql dumps. GHTorrent collects data through the github API (for specifics of data collection see this document) – the data it currently holds is similar in timespan and breadth to that which is available through BigQuery for the github timeline.

GHTorrent’s pull_requests table has 1,527,377 rows, each relating to a single pull request. GHTorrent stores the history of a pull request (its open and close/merge events) in a separate table (pull_request_history). 1,393,610 (91%) of the pull requests have been closed, of these 59% have been merged and 41% have not been merged.

15% of all pull requests on ghtorrent are ‘intra-branch’ (assuming these are the cases where base and head repo are the same) – 61% of these have been merged, so only slightly higher than for the set of all pull requests.

Finally, this document on ghtorrent has a description of their data collection procedures and database schema – and also some interesting observations about github API data. The following passage is of particular relevance here, specifically to the pull requests identified above which are closed but not merged.

Indeed, several projects choose to track the discussion on pull requests using Github’s facilities while doing the actual merge using git. This behaviour can be observed in projects where an usually big number of pull requests are closed without being reported as merged. In such cases, we can deduce that a pull request has been merged by checking whether the commits (identified by their SHA id) appear in the main project’s repository (through a metadata query). However, this heuristic is not complete, as several projects use commit-squashing or even diff-based patching to transfer commits between branches, thereby loosing authorship information.

This suggests that not all of the pull requests which have been closed without being merged have been ‘rejected’ – there is another method through which they could be merged without leaving a trace in terms of Pull Request events.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: