A seminar presentation by Matthew Fuller & Richard Mills of ongoing work on the Metacommunities of Code project
Wednesday March 12th 2014
16.00 – 18.00
Goldsmiths, Room RHB 143
The Metacommunities of Code project is an attempt to analyse code-sharing practices in free and open source software repositories, with a particular focus on GitHub. This presentation will discuss: the emergence of repositories as a developing form characteristic of contemporary forms of work; the electronic archive as a space of production; the use of statistical approaches within software studies; the material difficulties of working with and extracting highly mobile, commercially sensitive datasets; and some notes towards an analysis of the nature of code-sharing.
Matthew Fuller works at the Digital Culture Unit at the Centre for Cultural Studies, Goldsmiths. His most recent books are ‘Evil Media’ (with Andrew Goffey) and ‘Elephant and Castle’ and he is an editor of ‘Computational Culture, a journal of software studies’.
Richard Mills is a Researcher with a background in statistics based at Lancaster University. His PhD thesis was an analysis of Reddit.
Metacommunities of Code is a collaboration with Andrew Goffey, Adrian Mackenzie and Stuart Sharples.
The seminar is presented as part of the Data Practices series run by the Departments of Design and of Sociology at Goldsmiths.
This kind of supernova graph shows something perhaps worth more analysis — how organisations on github fork repositories. That is, the graph shows the connections between all the organisations (as elicited from the Github event timeline data) and repositories they fork (that is, make their own copy). The idea is to get some picture of how organisations structure the flow of forks and hence of code resources generally on Github. Are organisations an organising element here? It’s a hard problem because the organisations have very different modes of existence on github. Some, like ‘metacommunities’ are small informal groups of people, others like ‘mozilla’ are large official foundations. As the network graph shows, much of the organisational activity on github is not connected (the halo around the periphery), but that nucleus (coloured using a community-structure algorithm) shows a lot of connectivity that it would be good to unravel more.
This image on FoxNews
Here is a video on ‘repo’ from Fox Business News, describing the most heavily funded first round venture capital ever in Silicon Valley.
In a previous post we looked at different approaches to Pull Requests – and concluded by noting that repository pairings often involved multiple pull requests between a base/head repo pair. This post investigate these relationships further, and in particular considers the involvement of individual users. Pull requests allow any individual to compose and propose an amendment to any repository, where they are useful they represent a benefit to the receiving (base) repository. Through making useful pull requests an individual with no prior affiliation to a project could demonstrate their ability and usefulness to the project – and in doing so could potentially earn a place within the project. Is there any evidence that pull requests could serve as a ‘recruitment’ mechanism for projects in this manner – the first point of contact between a prospective employee and employer?
The starting point for this post is the data-set concerning 200k base/head repo pairings which had at least 4 pull requests between them – intra-repo pull requests are excluded. In this set there are 86,231 distinct base repos (indicating that base repos often had a ‘relationship’ with multiple heads) and 194,235 distinct head repos (indicating that head repos occasionally had a `relationship’ with multiple base repos).
For each base/head pair a full list of users who made pushes on the respective repositories was extracted from BigQuery, along with variables relating to the timing and frequency of their pushes. The ids and associated variables for users who made the pull requests connecting these repos were also extracted. In total there are 196k records representing 100k users who made pull requests (some users were involved in multiple base/head repo relationships).
Are there users who pull request their way into a project?
In the data-set there are 25,678 cases where a user had recorded pushes to both the base and head repos (in addition to submitting at least one pull request between these) – but that doesn’t necessarily mean they earned contributor rights through pull requests. What would it look like in the data if a user had done so? The prototypical scenario might go as follows: the user (probably forks first then) makes pushes to what will become the head repo in the base/head pair, they then submit a pull request to the base repo (and given how this data-set was produced they likely made many such pull requests) and some time after this there should be a ‘MemberEvent’ when they are officially made a contributor on the base repo; following this they may record pushes to the base repo in the pair directly. There are 3,416 cases which match this profile in the data.
However, the limiting factor here is the requirement of a MemberEvent observed after the user had made pushes and pull requests. These events are sparse in the data – only 5,538 users have a recorded ‘add memberevent’, whereas 25,678 users have at least one push to head and base repo of a pair – and therefore must have had contributor rights to the base repo (either events are missing from the data or they happened before the timeline data begins).
If the criteria are relaxed such that we only require the user to have made first a push on the head, then a pull request, then a push on the base – there are 14,348 users who meet it. This sequence of events is consistent with a user who pull requested their way into a project, but there are of course many other non-github forms of interaction which could have shaped these events.
These users made a median of 18 pushes to the head repo and 8 to the base repo, but the variance on these measures is very large. Once they had the ability to make pushes on the base repo they no longer needed to make pull requests, technically at least. 5,244 stopped making pull requests once they had made their first push to the base repo, for the remainder there is an overlap between the time when they were making pull requests and the time when they began making pushes on the base repo directly.
http://www.wired.com/wiredenterprise/2013/09/github-for-anything/?mbid=social11447144 discussion of how the White House github is quite interesting. suggests we could look more at the organisational diversity of github users.
This post follows on from the previous post about Pull Requests – where that post considered individual pull requests this will consider the same data from the perspective of the ‘base’ repo (the repo which is ‘receiving’ the pull request). The data for this post was obtained from the github timeline on BigQuery, with results being grouped by the base repo url. In total there are 195,509 repos which received at least one pull request event in the time period covered by timeline – that’s around 5% of all the repos which appear in the timeline data.
First a note about the data… as it’s timeline everything is event-based, an individual pull request typically has 2 events associated with it, one where it is opened and one where it is closed. However, there are individual pull request IDs which show up in more than 2 rows of data (i.e. they were opened/closed/re-opened multiple times). I’ll mostly be talking about repos’ numbers of distinct pull requests but the nature of the data-set I pulled from BigQuery means sometimes I have to resort to talking about numbers of open/close events.
The following graph shows that the distribution of distinct Pull Requests between repositories is highly skewed. This data-set follows the 80/20 rule almost perfectly, with 80% of Pull Requests being received by 20% of the repositories (which had any pull requests).
I started off here by trying to quantify the number of repositories which follow one of the approaches to pull requests outlined in the previous post – repositories which use pull requests internally (i.e. intra-repo).
Internal Use of Pull Requests
The previous post identified ‘intra-repo’ Pull Requests as an interesting use of the pull request mechanism. These are pull requests which are made within the same repo, they are not technically necessary because the user who created them would also have permission to push their changes directly to the repository. Instead they probably represent a deliberate decision by a project to use this feature as a way of tracking changes to the main branch of their repository. The graph below shows that the use of ‘intra-repo’ pull requests tends to be all-or-nothing for an individual repository (i.e. all or none of their pull requests are intra-repo).
I defined a few types of repository in the data to count their frequency.
Type 1: Single Pull Request which is intra-repo: 12,656 repos (6.5%). I am treating these seperately because the one-time use of an intra-repo pull request is more likely to represent an individual testing the pull request system rather than the repo making a deliberate choice to use intra-repo pull requests as a form of code review.
Type 2: Repo has more than 1 pull request and 95% or greater pull requests are intra-repo (‘purity of practice’): 15,916 repos (8%). These are repositories which use intra-repo pull requests and which have received no (or very few) pull requests from external repositories.
Example: ‘dotCMS/dotCMS‘ – has 1758 PullRequest OpenEvents, of which 1742 are intra-repo and 1604 are merged.
Type 3: Repo has more than 100 Intra-Repo pull requests OR a proportion which is 50% or greater (mixed practice): 4683 repos (2.5%). I refer to these as ‘mixed practice’ but really its not up to the repository itself whether it receives external pull requests (changing its classification from type 2 to 3).
Example: ‘mozilla/browserid‘ – has 2768 PullRequest OpenEvents, of which 870 are intra-repo.
There are also pull requests which were merged by the same user who instigated them, so this would be another form of ‘internal use’ of pull requests. As you might imagine there’s a lot of overlap between ‘intra-repo’ pull requests and those merged by the user who created them. I added a fourth type which is secondary to those concerning intra-repo pull requests.
Type 4: Repo has 50% or greater pull requests which were merged by the same user who instigated them. In total there are 30619 repos which meet this criterion – 14235 are classified as type 4 (the remainder meet one of the intra-repo criteria).
Example: ‘angular/angular.js‘ – this one is interesting because it seems to receive a lot of pull requests from external repos but the only merges are ‘self’ merges. 937 Distinct Pull Requests from 433 Distinct Head Repos – 169 are merged, and 163 of these originated with the same user who accepted the merge (but are not intra-repo).
Is there a relationship between repository activity-level and approach to pull requests?
One question we might ask about these ‘internal use’ approaches to pull requests is whether the repositories which adopt them tend to be of a particular size (i.e. small). To check this out I linked the repositories in this data-set to the previous ‘repo event census’ data-set and made a stacked bar plot with each bar showing repositories from one of the ‘event bins’ – bars being coloured to reflect the approaches to pull requests defined above. This graph suggests the absence of a strong relationship between repo activity level and approach to pull requests – with the exception of repos that had a single intra-repo pull request, these being much more common among the ‘low-activity’ bins.
Although this typology is interesting it leaves us in the dark about most of the repositories which have pull requests. From here I’ll be working with the full data-set aside from repos with just a single pull request.
Do repos deal with all of their incoming pull requests?
This graph considers the proportion of pull request open events which have a corresponding close event – the panes represent repos with a specified number of distinct pull requests. Most of the repositories have closed all of the pull requests which they received (with a high proportion of these closes involving the merging of the pull request). A small proportion of repositories which have 2-4 distinct pull requests don’t seem to be closing the pull requests they receive. Among repositories with at least 5 distinct pull requests very few have a low proportion of pull request closes. Perhaps users who are thinking about making a pull request on a repository look at its previous history of dealing with pull requests – being discouraged if they see a number of unresolved pull requests.
Do repos have a designated individual who deals with pull requests?
Any github user who is registered as a contributor for a repo can close pull requests. However, this seems like quite a specialised task, with the individual who deals with the pull request needing quite a high-level understanding of the project’s code and direction – and therefore one individual may be designated to deal with incoming pull requests. In this data-set the majority of repositories (69% of those with at least 2 pull requests) have a single user who made all of the merges for the repository. The following graph shows the number of different users to accept Pull Request merges for each repo, excluding repos where all merges were accepted by the same user.
Base/Head repo relationships
One of the trends which is apparant in this data-set is that the number of distinct head repos is often much smaller than the number of distinct pull requests – i.e. base repos must be receiving multiple pull requests from the same head repo(s). For each base repository I divided the number of distinct pull requests by the number of distinct head repos (to make a pull request on that base repository) to produce a rough ‘pull requests per distinct head repo’ variable. I have graphed this below for all repositories which received at least two pull requests – panes represent repositories with a given number of distinct pull requests.
This graph suggests that it is common for a base/head repo pair to consist of more than one pull request. For example, where a base repo has received two pull requests these originate with the same head repo more often than not. This suggests the possibility of ongoing relationships between pairs (or sets) of repos, facilitated by github’s fork and pull request mechanisms. Furthermore, it is possible to study the development of these relationships in considerable detail with the Timeline data – but I’ll leave that for another post as this one is long enough already.