Skip to content
August 16, 2013 / milllss

Github Repo ‘Events Census’

The following is some exploration of the Github Timeline data hosted on Google BigQuery. The timeline data is event-based – every row represents an event (e.g. ‘PullRequestEvent’ for the opening or closing of a pull request, ‘MemberEvent’ when a user is added or removed as a contributor to a repo). In a previous post we saw the raw number of events of each different type on Timeline – this post will give some more context by looking at the distribution of events between repos.

All the data referenced in this post has been obtained by querying the Timeline table on BigQuery with a GROUP BY repository URL statement – it incorporates all of the Timeline data from its beginning until 10am on 7th June 2013 .

Single Event Repos

In total there are 4,089,837 distinct repo URLs in the data. 838,443 of these only appear once in the data – and for 556,645 of these repos the single event is a CreationEvent, suggesting that these repos were created and then never used for anything. The other single-event repos were likely created before Timeline data begins and saw a single event recorded during the span of Timeline data – there are probably many further inactive repos which pre-date Timeline but have seen no activity since the Timeline data began (and therefore don’t appear in this data-set). I’m setting aside the single-event repos for now and they won’t feature in the rest of this post.

‘Spam’ repos

If you order the repos by number of Events there is something strange going on with some of the repos which have very high Event counts. Specifically, there are repos with very high counts for one or two types of event, and very low counts for every other type – they were often active for a short period of time and only had a handful of different user accounts associated with them in the whole Timeline. For instance, the ‘techradical/euro2012’ repo has 154232 events total, and 154224 of these are Pushevents – there are only 2 ‘Actors’ who have activity related to this repo and it was active for just over 1 month – it has no more than a handful of Watchers or Forks at any point. For another example see ‘lukaseder/jOOQ-trac-ticket-import-test’ – where one Actor racked up 40k IssueEvents in just over a week and the repo’s Size never rose above zero. These are fairly obviously cases where someone has created a repo and let some sort of bot loose on it, spamming the github timeline with meaningless events (another such repo is called ‘reach-github-limit’). For now I’m going to exclude 315,066 repos with a maximal size of 0 and another 40 which have suspicious patterns of activity.

Distribution of Events between Repos
events_dist_for_repos

Above are plots showing the distribution of events between repositories. Top left is a standard line graph, top right has logarithmic axes, bottom left is the inverse cumulative on logarithmic axes (if the distribution is Power Law the last two graphs should have lines which are straight diagonals – so not far off a power law here by the look of it). These plots include all the repos with at least 2 events.

One oft-cited corollary of the power law is the 80/20 rule – for this data the most active 13% of repositories have between them 80% of all Events. The most active 0.5% of repositories account for 37.5% of all Events. As with any highly-skewed distribution – the activity ‘on github’ is happening largely on a relatively small set of repositories.

Types of Event – where is the data?

If we were to look at the distribution of each different type of Event they would all broadly follow the same highly skewed form – not terribly informative about what the distribution means for github in practice.

One of the questions we might want to ask is whether the repos with a lot of events are in some way different to those with a small or moderate number of events. I’ve placed repos into ‘logarithmic’ bins based on their total number of events (and expanded the largest bin so that it includes all of the 440 most active repos). Then, for each bin, I’ve calculated the total number of events of each type and expressed these as a proportion of that type of event in the full data-set. So, for example, 5% of all Push Events in the timeline data relate to the (440) repos in the top bin.

Shows the proportion of 'Owner' events which relate to repos in each of the event bins

Shows the proportion of ‘Owner’ events which relate to repos in each of the event bins

This first graph relates to ‘owner events’ – Events which can only be instigated by a user who has contributor rights for the repo. The black line on these graphs shows the proportion of repositories which fall into each bin. The green and grey lines show that CreateEvents and MemberEvents occur mostly on repos which have a small-to-moderate total number of events. PushEvents are in a sense the raw materials of github, and most of these relate to moderately-active repos.

Shows the proportion of 'Social' events which relate to repos in each of the event bins

Shows the proportion of ‘Social’ events which relate to repos in each of the event bins

This second graph relates to ‘social events’ – actions which can be instigated by a user external to the repo. There is a different pattern to the distribution of these events, with the most highly active repos accounting for a disproportionate share of events like Pull Requests, Watches and Forks. This suggests that the repos in the top bins are not only active but also well-known – when we bin repos by their number of events these social events are central to the high-activity bins.

Latent Class Analysis of Repositories

To try and get a feel for the different ‘types’ of repository in this data-set I ran a Latent Class analysis on the 434,931 repositories with at least 33 events. I used counts of several event types as the explanatory variables – number of Push, Watch, Issue, Fork, Pull Request and Delete events. The graph below shows mean values for each of 18 latent classes – the percentage in the header of each pane is the percentage of repositories which were most likely to belong to that class/cluster.

Mean values for each of 18 latent classes

Mean values for each of 18 latent classes

There are a few broad patterns which I believe are informative here. Firstly, there are many classes (together accounting for a large proportion of all repositories) which are characterised by the dominance of Push events. Repositories in these clusters could be described as a-social – they have very low levels of all the social event types. Although these repositories are registered on github this is not affecting them – nobody is watching or forking the repository, it is not receiving pull requests. In practice these repositories are not gaining any of the potential benefits of being registered on github.

The remaining clusters are ‘social’ to varying degrees – clusters 15 and 16 represent repositories which have a high level of social events of all types considered here. Clusters 6, 10 and 17 (17 is dominated by a few outliers) are characterised by Issue events, suggesting that the main use these repositories have for github is as an issue tracker. There are also repository classes for which Watch events are their defining characteristic (clusters 4, 8, 11, 13, 16) – these repositories must be, or have been, relatively well-known among github users (perhaps they were featured somewhere which gained them publicity). A large number of Watch events is always accompanied by a substantial volume of other social event types, although this ratio fluctuates between classes.

Clusters 14, 15 and 16 have something to say about the highly active repositories (with more than 1,500 Push events each) which account for a lot of Github’s activity. These repositories can be ‘a-social’ (cluster 14) or they can have a substantial number of ‘social’ events, in the case of cluster 16 the number of push events being dwarfed by social events.

We could speculate that for a repo to be part of the ‘social coding scene’ on github it must first be seen (and watched) – whether a repo’s watchers will interact with it further is likely determined by characteristics of the repo itself (e.g. whether it is suited to receiving pull requests, whether it contains code which someone might wish to fork and develop further).

This analysis also suggests another way of looking at repositories – all of these effectively begin as ‘a-social’ and a significant proportion remain as such despite ongoing activity (Push events). It might be interesting to look at repos at the stage when they begin to receive ‘social’ events – with a view to understanding how this transition occurs.

Advertisements

3 Comments

Leave a Comment
  1. lukaseder / Nov 20 2013 9:01 am

    I thought you might fancy some background info on this event:

    > ‘lukaseder/jOOQ-trac-ticket-import-test’ – where one Actor racked up 40k IssueEvents in just over a week and the repo’s Size never rose above zero.

    As the repository name says, this was a test to migrate tickets from a SourceForge trac instance to GitHub. I was using this useful utility here: https://github.com/trustmaster/trac2github

    Unfortunately, it had quite a few bugs / missing features in it, which is why I needed to run around 50 test runs before I could execute the actual migration. At the time, GitHub didn’t have a sort of “staging” or “playground” repository (they probably still don’t), where I could run these tests, so I needed to run them on “the real thing”.

    It’s really interesting that these tests had such side effects on your analyses though.

    • rian39 / Nov 20 2013 2:48 pm

      Luke
      thanks for that useful info. Makes me wonder how many events and repos on github come about in similar ways – Github says they have 8 million repos, but maybe there is substantial testing activity there

      Adrian

      • lukaseder / Nov 20 2013 2:51 pm

        I guess there is. Lots of little repos with no stars, few commits, not really useful stuff, either.

        Even that trac2github repo… to me it was incredibly useful, but no one knows about it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: