This kind of supernova graph shows something perhaps worth more analysis — how organisations on github fork repositories. That is, the graph shows the connections between all the organisations (as elicited from the Github event timeline data) and repositories they fork (that is, make their own copy). The idea is to get some picture of how organisations structure the flow of forks and hence of code resources generally on Github. Are organisations an organising element here? It’s a hard problem because the organisations have very different modes of existence on github. Some, like ‘metacommunities’ are small informal groups of people, others like ‘mozilla’ are large official foundations. As the network graph shows, much of the organisational activity on github is not connected (the halo around the periphery), but that nucleus (coloured using a community-structure algorithm) shows a lot of connectivity that it would be good to unravel more.
This image on FoxNews
Here is a video on ‘repo’ from Fox Business News, describing the most heavily funded first round venture capital ever in Silicon Valley.
In a previous post we looked at different approaches to Pull Requests – and concluded by noting that repository pairings often involved multiple pull requests between a base/head repo pair. This post investigate these relationships further, and in particular considers the involvement of individual users. Pull requests allow any individual to compose and propose an amendment to any repository, where they are useful they represent a benefit to the receiving (base) repository. Through making useful pull requests an individual with no prior affiliation to a project could demonstrate their ability and usefulness to the project – and in doing so could potentially earn a place within the project. Is there any evidence that pull requests could serve as a ‘recruitment’ mechanism for projects in this manner – the first point of contact between a prospective employee and employer?
The starting point for this post is the data-set concerning 200k base/head repo pairings which had at least 4 pull requests between them – intra-repo pull requests are excluded. In this set there are 86,231 distinct base repos (indicating that base repos often had a ‘relationship’ with multiple heads) and 194,235 distinct head repos (indicating that head repos occasionally had a `relationship’ with multiple base repos).
For each base/head pair a full list of users who made pushes on the respective repositories was extracted from BigQuery, along with variables relating to the timing and frequency of their pushes. The ids and associated variables for users who made the pull requests connecting these repos were also extracted. In total there are 196k records representing 100k users who made pull requests (some users were involved in multiple base/head repo relationships).
Are there users who pull request their way into a project?
In the data-set there are 25,678 cases where a user had recorded pushes to both the base and head repos (in addition to submitting at least one pull request between these) – but that doesn’t necessarily mean they earned contributor rights through pull requests. What would it look like in the data if a user had done so? The prototypical scenario might go as follows: the user (probably forks first then) makes pushes to what will become the head repo in the base/head pair, they then submit a pull request to the base repo (and given how this data-set was produced they likely made many such pull requests) and some time after this there should be a ‘MemberEvent’ when they are officially made a contributor on the base repo; following this they may record pushes to the base repo in the pair directly. There are 3,416 cases which match this profile in the data.
However, the limiting factor here is the requirement of a MemberEvent observed after the user had made pushes and pull requests. These events are sparse in the data – only 5,538 users have a recorded ‘add memberevent’, whereas 25,678 users have at least one push to head and base repo of a pair – and therefore must have had contributor rights to the base repo (either events are missing from the data or they happened before the timeline data begins).
If the criteria are relaxed such that we only require the user to have made first a push on the head, then a pull request, then a push on the base – there are 14,348 users who meet it. This sequence of events is consistent with a user who pull requested their way into a project, but there are of course many other non-github forms of interaction which could have shaped these events.
These users made a median of 18 pushes to the head repo and 8 to the base repo, but the variance on these measures is very large. Once they had the ability to make pushes on the base repo they no longer needed to make pull requests, technically at least. 5,244 stopped making pull requests once they had made their first push to the base repo, for the remainder there is an overlap between the time when they were making pull requests and the time when they began making pushes on the base repo directly.
http://www.wired.com/wiredenterprise/2013/09/github-for-anything/?mbid=social11447144 discussion of how the White House github is quite interesting. suggests we could look more at the organisational diversity of github users.
This post follows on from the previous post about Pull Requests – where that post considered individual pull requests this will consider the same data from the perspective of the ‘base’ repo (the repo which is ‘receiving’ the pull request). The data for this post was obtained from the github timeline on BigQuery, with results being grouped by the base repo url. In total there are 195,509 repos which received at least one pull request event in the time period covered by timeline – that’s around 5% of all the repos which appear in the timeline data.
First a note about the data… as it’s timeline everything is event-based, an individual pull request typically has 2 events associated with it, one where it is opened and one where it is closed. However, there are individual pull request IDs which show up in more than 2 rows of data (i.e. they were opened/closed/re-opened multiple times). I’ll mostly be talking about repos’ numbers of distinct pull requests but the nature of the data-set I pulled from BigQuery means sometimes I have to resort to talking about numbers of open/close events.
The following graph shows that the distribution of distinct Pull Requests between repositories is highly skewed. This data-set follows the 80/20 rule almost perfectly, with 80% of Pull Requests being received by 20% of the repositories (which had any pull requests).
I started off here by trying to quantify the number of repositories which follow one of the approaches to pull requests outlined in the previous post – repositories which use pull requests internally (i.e. intra-repo).
Internal Use of Pull Requests
The previous post identified ‘intra-repo’ Pull Requests as an interesting use of the pull request mechanism. These are pull requests which are made within the same repo, they are not technically necessary because the user who created them would also have permission to push their changes directly to the repository. Instead they probably represent a deliberate decision by a project to use this feature as a way of tracking changes to the main branch of their repository. The graph below shows that the use of ‘intra-repo’ pull requests tends to be all-or-nothing for an individual repository (i.e. all or none of their pull requests are intra-repo).
I defined a few types of repository in the data to count their frequency.
Type 1: Single Pull Request which is intra-repo: 12,656 repos (6.5%). I am treating these seperately because the one-time use of an intra-repo pull request is more likely to represent an individual testing the pull request system rather than the repo making a deliberate choice to use intra-repo pull requests as a form of code review.
Type 2: Repo has more than 1 pull request and 95% or greater pull requests are intra-repo (‘purity of practice’): 15,916 repos (8%). These are repositories which use intra-repo pull requests and which have received no (or very few) pull requests from external repositories.
Example: ‘dotCMS/dotCMS‘ – has 1758 PullRequest OpenEvents, of which 1742 are intra-repo and 1604 are merged.
Type 3: Repo has more than 100 Intra-Repo pull requests OR a proportion which is 50% or greater (mixed practice): 4683 repos (2.5%). I refer to these as ‘mixed practice’ but really its not up to the repository itself whether it receives external pull requests (changing its classification from type 2 to 3).
Example: ‘mozilla/browserid‘ – has 2768 PullRequest OpenEvents, of which 870 are intra-repo.
There are also pull requests which were merged by the same user who instigated them, so this would be another form of ‘internal use’ of pull requests. As you might imagine there’s a lot of overlap between ‘intra-repo’ pull requests and those merged by the user who created them. I added a fourth type which is secondary to those concerning intra-repo pull requests.
Type 4: Repo has 50% or greater pull requests which were merged by the same user who instigated them. In total there are 30619 repos which meet this criterion – 14235 are classified as type 4 (the remainder meet one of the intra-repo criteria).
Example: ‘angular/angular.js‘ – this one is interesting because it seems to receive a lot of pull requests from external repos but the only merges are ‘self’ merges. 937 Distinct Pull Requests from 433 Distinct Head Repos – 169 are merged, and 163 of these originated with the same user who accepted the merge (but are not intra-repo).
Is there a relationship between repository activity-level and approach to pull requests?
One question we might ask about these ‘internal use’ approaches to pull requests is whether the repositories which adopt them tend to be of a particular size (i.e. small). To check this out I linked the repositories in this data-set to the previous ‘repo event census’ data-set and made a stacked bar plot with each bar showing repositories from one of the ‘event bins’ – bars being coloured to reflect the approaches to pull requests defined above. This graph suggests the absence of a strong relationship between repo activity level and approach to pull requests – with the exception of repos that had a single intra-repo pull request, these being much more common among the ‘low-activity’ bins.
Although this typology is interesting it leaves us in the dark about most of the repositories which have pull requests. From here I’ll be working with the full data-set aside from repos with just a single pull request.
Do repos deal with all of their incoming pull requests?
This graph considers the proportion of pull request open events which have a corresponding close event – the panes represent repos with a specified number of distinct pull requests. Most of the repositories have closed all of the pull requests which they received (with a high proportion of these closes involving the merging of the pull request). A small proportion of repositories which have 2-4 distinct pull requests don’t seem to be closing the pull requests they receive. Among repositories with at least 5 distinct pull requests very few have a low proportion of pull request closes. Perhaps users who are thinking about making a pull request on a repository look at its previous history of dealing with pull requests – being discouraged if they see a number of unresolved pull requests.
Do repos have a designated individual who deals with pull requests?
Any github user who is registered as a contributor for a repo can close pull requests. However, this seems like quite a specialised task, with the individual who deals with the pull request needing quite a high-level understanding of the project’s code and direction – and therefore one individual may be designated to deal with incoming pull requests. In this data-set the majority of repositories (69% of those with at least 2 pull requests) have a single user who made all of the merges for the repository. The following graph shows the number of different users to accept Pull Request merges for each repo, excluding repos where all merges were accepted by the same user.
Base/Head repo relationships
One of the trends which is apparant in this data-set is that the number of distinct head repos is often much smaller than the number of distinct pull requests – i.e. base repos must be receiving multiple pull requests from the same head repo(s). For each base repository I divided the number of distinct pull requests by the number of distinct head repos (to make a pull request on that base repository) to produce a rough ‘pull requests per distinct head repo’ variable. I have graphed this below for all repositories which received at least two pull requests – panes represent repositories with a given number of distinct pull requests.
This graph suggests that it is common for a base/head repo pair to consist of more than one pull request. For example, where a base repo has received two pull requests these originate with the same head repo more often than not. This suggests the possibility of ongoing relationships between pairs (or sets) of repos, facilitated by github’s fork and pull request mechanisms. Furthermore, it is possible to study the development of these relationships in considerable detail with the Timeline data – but I’ll leave that for another post as this one is long enough already.
The following is some exploration of the Github Timeline data hosted on Google BigQuery. The timeline data is event-based – every row represents an event (e.g. ‘PullRequestEvent’ for the opening or closing of a pull request, ‘MemberEvent’ when a user is added or removed as a contributor to a repo). In a previous post we saw the raw number of events of each different type on Timeline – this post will give some more context by looking at the distribution of events between repos.
All the data referenced in this post has been obtained by querying the Timeline table on BigQuery with a GROUP BY repository URL statement – it incorporates all of the Timeline data from its beginning until 10am on 7th June 2013 .
Single Event Repos
In total there are 4,089,837 distinct repo URLs in the data. 838,443 of these only appear once in the data – and for 556,645 of these repos the single event is a CreationEvent, suggesting that these repos were created and then never used for anything. The other single-event repos were likely created before Timeline data begins and saw a single event recorded during the span of Timeline data – there are probably many further inactive repos which pre-date Timeline but have seen no activity since the Timeline data began (and therefore don’t appear in this data-set). I’m setting aside the single-event repos for now and they won’t feature in the rest of this post.
If you order the repos by number of Events there is something strange going on with some of the repos which have very high Event counts. Specifically, there are repos with very high counts for one or two types of event, and very low counts for every other type – they were often active for a short period of time and only had a handful of different user accounts associated with them in the whole Timeline. For instance, the ‘techradical/euro2012′ repo has 154232 events total, and 154224 of these are Pushevents – there are only 2 ‘Actors’ who have activity related to this repo and it was active for just over 1 month – it has no more than a handful of Watchers or Forks at any point. For another example see ‘lukaseder/jOOQ-trac-ticket-import-test’ – where one Actor racked up 40k IssueEvents in just over a week and the repo’s Size never rose above zero. These are fairly obviously cases where someone has created a repo and let some sort of bot loose on it, spamming the github timeline with meaningless events (another such repo is called ‘reach-github-limit’). For now I’m going to exclude 315,066 repos with a maximal size of 0 and another 40 which have suspicious patterns of activity.
Above are plots showing the distribution of events between repositories. Top left is a standard line graph, top right has logarithmic axes, bottom left is the inverse cumulative on logarithmic axes (if the distribution is Power Law the last two graphs should have lines which are straight diagonals – so not far off a power law here by the look of it). These plots include all the repos with at least 2 events.
One oft-cited corollary of the power law is the 80/20 rule – for this data the most active 13% of repositories have between them 80% of all Events. The most active 0.5% of repositories account for 37.5% of all Events. As with any highly-skewed distribution – the activity ‘on github’ is happening largely on a relatively small set of repositories.
Types of Event – where is the data?
If we were to look at the distribution of each different type of Event they would all broadly follow the same highly skewed form – not terribly informative about what the distribution means for github in practice.
One of the questions we might want to ask is whether the repos with a lot of events are in some way different to those with a small or moderate number of events. I’ve placed repos into ‘logarithmic’ bins based on their total number of events (and expanded the largest bin so that it includes all of the 440 most active repos). Then, for each bin, I’ve calculated the total number of events of each type and expressed these as a proportion of that type of event in the full data-set. So, for example, 5% of all Push Events in the timeline data relate to the (440) repos in the top bin.
This first graph relates to ‘owner events’ – Events which can only be instigated by a user who has contributor rights for the repo. The black line on these graphs shows the proportion of repositories which fall into each bin. The green and grey lines show that CreateEvents and MemberEvents occur mostly on repos which have a small-to-moderate total number of events. PushEvents are in a sense the raw materials of github, and most of these relate to moderately-active repos.
This second graph relates to ‘social events’ – actions which can be instigated by a user external to the repo. There is a different pattern to the distribution of these events, with the most highly active repos accounting for a disproportionate share of events like Pull Requests, Watches and Forks. This suggests that the repos in the top bins are not only active but also well-known – when we bin repos by their number of events these social events are central to the high-activity bins.
Latent Class Analysis of Repositories
To try and get a feel for the different ‘types’ of repository in this data-set I ran a Latent Class analysis on the 434,931 repositories with at least 33 events. I used counts of several event types as the explanatory variables – number of Push, Watch, Issue, Fork, Pull Request and Delete events. The graph below shows mean values for each of 18 latent classes – the percentage in the header of each pane is the percentage of repositories which were most likely to belong to that class/cluster.
There are a few broad patterns which I believe are informative here. Firstly, there are many classes (together accounting for a large proportion of all repositories) which are characterised by the dominance of Push events. Repositories in these clusters could be described as a-social – they have very low levels of all the social event types. Although these repositories are registered on github this is not affecting them – nobody is watching or forking the repository, it is not receiving pull requests. In practice these repositories are not gaining any of the potential benefits of being registered on github.
The remaining clusters are ‘social’ to varying degrees – clusters 15 and 16 represent repositories which have a high level of social events of all types considered here. Clusters 6, 10 and 17 (17 is dominated by a few outliers) are characterised by Issue events, suggesting that the main use these repositories have for github is as an issue tracker. There are also repository classes for which Watch events are their defining characteristic (clusters 4, 8, 11, 13, 16) – these repositories must be, or have been, relatively well-known among github users (perhaps they were featured somewhere which gained them publicity). A large number of Watch events is always accompanied by a substantial volume of other social event types, although this ratio fluctuates between classes.
Clusters 14, 15 and 16 have something to say about the highly active repositories (with more than 1,500 Push events each) which account for a lot of Github’s activity. These repositories can be ‘a-social’ (cluster 14) or they can have a substantial number of ‘social’ events, in the case of cluster 16 the number of push events being dwarfed by social events.
We could speculate that for a repo to be part of the ‘social coding scene’ on github it must first be seen (and watched) – whether a repo’s watchers will interact with it further is likely determined by characteristics of the repo itself (e.g. whether it is suited to receiving pull requests, whether it contains code which someone might wish to fork and develop further).
This analysis also suggests another way of looking at repositories – all of these effectively begin as ‘a-social’ and a significant proportion remain as such despite ongoing activity (Push events). It might be interesting to look at repos at the stage when they begin to receive ‘social’ events – with a view to understanding how this transition occurs.