Skip to content
August 27, 2013 / milllss

More notes on Pull Requests – Base Repository perspective

This post follows on from the previous post about Pull Requests – where that post considered individual pull requests this will consider the same data from the perspective of the ‘base’ repo (the repo which is ‘receiving’ the pull request). The data for this post was obtained from the github timeline on BigQuery, with results being grouped by the base repo url. In total there are 195,509 repos which received at least one pull request event in the time period covered by timeline – that’s around 5% of all the repos which appear in the timeline data.

First a note about the data… as it’s timeline everything is event-based, an individual pull request typically has 2 events associated with it, one where it is opened and one where it is closed. However, there are individual pull request IDs which show up in more than 2 rows of data (i.e. they were opened/closed/re-opened multiple times). I’ll mostly be talking about repos’ numbers of distinct pull requests but the nature of the data-set I pulled from BigQuery means sometimes I have to resort to talking about numbers of open/close events.

The following graph shows that the distribution of distinct Pull Requests between repositories is highly skewed. This data-set follows the 80/20 rule almost perfectly, with 80% of Pull Requests being received by 20% of the repositories (which had any pull requests).

pullrequest_dist_for_repos

I started off here by trying to quantify the number of repositories which follow one of the approaches to pull requests outlined in the previous post – repositories which use pull requests internally (i.e. intra-repo).

Internal Use of Pull Requests

The previous post identified ‘intra-repo’ Pull Requests as an interesting use of the pull request mechanism. These are pull requests which are made within the same repo, they are not technically necessary because the user who created them would also have permission to push their changes directly to the repository. Instead they probably represent a deliberate decision by a project to use this feature as a way of tracking changes to the main branch of their repository. The graph below shows that the use of ‘intra-repo’ pull requests tends to be all-or-nothing for an individual repository (i.e. all or none of their pull requests are intra-repo).

Intra Repo Proportion

I defined a few types of repository in the data to count their frequency.

Type 1: Single Pull Request which is intra-repo: 12,656 repos (6.5%). I am treating these seperately because the one-time use of an intra-repo pull request is more likely to represent an individual testing the pull request system rather than the repo making a deliberate choice to use intra-repo pull requests as a form of code review.

Type 2: Repo has more than 1 pull request and 95% or greater pull requests are intra-repo (‘purity of practice’): 15,916 repos (8%). These are repositories which use intra-repo pull requests and which have received no (or very few) pull requests from external repositories.

Example: ‘dotCMS/dotCMS‘ – has 1758 PullRequest OpenEvents, of which 1742 are intra-repo and 1604 are merged.

Type 3: Repo has more than 100 Intra-Repo pull requests OR a proportion which is 50% or greater (mixed practice): 4683 repos (2.5%). I refer to these as ‘mixed practice’ but really its not up to the repository itself whether it receives external pull requests (changing its classification from type 2 to 3).

Example: ‘mozilla/browserid‘ – has 2768 PullRequest OpenEvents, of which 870 are intra-repo.

There are also pull requests which were merged by the same user who instigated them, so this would be another form of ‘internal use’ of pull requests. As you might imagine there’s a lot of overlap between ‘intra-repo’ pull requests and those merged by the user who created them. I added a fourth type which is secondary to those concerning intra-repo pull requests.

Type 4: Repo has 50% or greater pull requests which were merged by the same user who instigated them. In total there are 30619 repos which meet this criterion – 14235 are classified as type 4 (the remainder meet one of the intra-repo criteria).

Example: ‘angular/angular.js‘ – this one is interesting because it seems to receive a lot of pull requests from external repos but the only merges are ‘self’ merges. 937 Distinct Pull Requests from 433 Distinct Head Repos – 169 are merged, and 163 of these originated with the same user who accepted the merge (but are not intra-repo).

Is there a relationship between repository activity-level and approach to pull requests?

One question we might ask about these ‘internal use’ approaches to pull requests is whether the repositories which adopt them tend to be of a particular size (i.e. small). To check this out I linked the repositories in this data-set to the previous ‘repo event census’ data-set and made a stacked bar plot with each bar showing repositories from one of the ‘event bins’ – bars being coloured to reflect the approaches to pull requests defined above. This graph suggests the absence of a strong relationship between repo activity level and approach to pull requests – with the exception of repos that had a single intra-repo pull request, these being much more common among the ‘low-activity’ bins.

type1-eventbins

Although this typology is interesting it leaves us in the dark about most of the repositories which have pull requests. From here I’ll be working with the full data-set aside from repos with just a single pull request.

Do repos deal with all of their incoming pull requests?

Closedprop

This graph considers the proportion of pull request open events which have a corresponding close event – the panes represent repos with a specified number of distinct pull requests. Most of the repositories have closed all of the pull requests which they received (with a high proportion of these closes involving the merging of the pull request). A small proportion of repositories which have 2-4 distinct pull requests don’t seem to be closing the pull requests they receive. Among repositories with at least 5 distinct pull requests very few have a low proportion of pull request closes. Perhaps users who are thinking about making a pull request on a repository look at its previous history of dealing with pull requests – being discouraged if they see a number of unresolved pull requests.

Do repos have a designated individual who deals with pull requests?

Any github user who is registered as a contributor for a repo can close pull requests. However, this seems like quite a specialised task, with the individual who deals with the pull request needing quite a high-level understanding of the project’s code and direction – and therefore one individual may be designated to deal with incoming pull requests. In this data-set the majority of repositories (69% of those with at least 2 pull requests) have a single user who made all of the merges for the repository. The following graph shows the number of different users to accept Pull Request merges for each repo, excluding repos where all merges were accepted by the same user.

mergers

Base/Head repo relationships

One of the trends which is apparant in this data-set is that the number of distinct head repos is often much smaller than the number of distinct pull requests – i.e. base repos must be receiving multiple pull requests from the same head repo(s). For each base repository I divided the number of distinct pull requests by the number of distinct head repos (to make a pull request on that base repository) to produce a rough ‘pull requests per distinct head repo’ variable. I have graphed this below for all repositories which received at least two pull requests – panes represent repositories with a given number of distinct pull requests.

pullsperheadrepo

This graph suggests that it is common for a base/head repo pair to consist of more than one pull request. For example, where a base repo has received two pull requests these originate with the same head repo more often than not. This suggests the possibility of ongoing relationships between pairs (or sets) of repos, facilitated by github’s fork and pull request mechanisms. Furthermore, it is possible to study the development of these relationships in considerable detail with the Timeline data – but I’ll leave that for another post as this one is long enough already.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: