Skip to content
August 16, 2013 / milllss

Github Repo ‘Events Census’

The following is some exploration of the Github Timeline data hosted on Google BigQuery. The timeline data is event-based – every row represents an event (e.g. ‘PullRequestEvent’ for the opening or closing of a pull request, ‘MemberEvent’ when a user is added or removed as a contributor to a repo). In a previous post we saw the raw number of events of each different type on Timeline – this post will give some more context by looking at the distribution of events between repos.

All the data referenced in this post has been obtained by querying the Timeline table on BigQuery with a GROUP BY repository URL statement – it incorporates all of the Timeline data from its beginning until 10am on 7th June 2013 .

Single Event Repos

In total there are 4,089,837 distinct repo URLs in the data. 838,443 of these only appear once in the data – and for 556,645 of these repos the single event is a CreationEvent, suggesting that these repos were created and then never used for anything. The other single-event repos were likely created before Timeline data begins and saw a single event recorded during the span of Timeline data – there are probably many further inactive repos which pre-date Timeline but have seen no activity since the Timeline data began (and therefore don’t appear in this data-set). I’m setting aside the single-event repos for now and they won’t feature in the rest of this post.

‘Spam’ repos

If you order the repos by number of Events there is something strange going on with some of the repos which have very high Event counts. Specifically, there are repos with very high counts for one or two types of event, and very low counts for every other type – they were often active for a short period of time and only had a handful of different user accounts associated with them in the whole Timeline. For instance, the ‘techradical/euro2012’ repo has 154232 events total, and 154224 of these are Pushevents – there are only 2 ‘Actors’ who have activity related to this repo and it was active for just over 1 month – it has no more than a handful of Watchers or Forks at any point. For another example see ‘lukaseder/jOOQ-trac-ticket-import-test’ – where one Actor racked up 40k IssueEvents in just over a week and the repo’s Size never rose above zero. These are fairly obviously cases where someone has created a repo and let some sort of bot loose on it, spamming the github timeline with meaningless events (another such repo is called ‘reach-github-limit’). For now I’m going to exclude 315,066 repos with a maximal size of 0 and another 40 which have suspicious patterns of activity.

Distribution of Events between Repos
events_dist_for_repos

Above are plots showing the distribution of events between repositories. Top left is a standard line graph, top right has logarithmic axes, bottom left is the inverse cumulative on logarithmic axes (if the distribution is Power Law the last two graphs should have lines which are straight diagonals – so not far off a power law here by the look of it). These plots include all the repos with at least 2 events.

One oft-cited corollary of the power law is the 80/20 rule – for this data the most active 13% of repositories have between them 80% of all Events. The most active 0.5% of repositories account for 37.5% of all Events. As with any highly-skewed distribution – the activity ‘on github’ is happening largely on a relatively small set of repositories.

Types of Event – where is the data?

If we were to look at the distribution of each different type of Event they would all broadly follow the same highly skewed form – not terribly informative about what the distribution means for github in practice.

One of the questions we might want to ask is whether the repos with a lot of events are in some way different to those with a small or moderate number of events. I’ve placed repos into ‘logarithmic’ bins based on their total number of events (and expanded the largest bin so that it includes all of the 440 most active repos). Then, for each bin, I’ve calculated the total number of events of each type and expressed these as a proportion of that type of event in the full data-set. So, for example, 5% of all Push Events in the timeline data relate to the (440) repos in the top bin.

Shows the proportion of 'Owner' events which relate to repos in each of the event bins

Shows the proportion of ‘Owner’ events which relate to repos in each of the event bins

This first graph relates to ‘owner events’ – Events which can only be instigated by a user who has contributor rights for the repo. The black line on these graphs shows the proportion of repositories which fall into each bin. The green and grey lines show that CreateEvents and MemberEvents occur mostly on repos which have a small-to-moderate total number of events. PushEvents are in a sense the raw materials of github, and most of these relate to moderately-active repos.

Shows the proportion of 'Social' events which relate to repos in each of the event bins

Shows the proportion of ‘Social’ events which relate to repos in each of the event bins

This second graph relates to ‘social events’ – actions which can be instigated by a user external to the repo. There is a different pattern to the distribution of these events, with the most highly active repos accounting for a disproportionate share of events like Pull Requests, Watches and Forks. This suggests that the repos in the top bins are not only active but also well-known – when we bin repos by their number of events these social events are central to the high-activity bins.

Latent Class Analysis of Repositories

To try and get a feel for the different ‘types’ of repository in this data-set I ran a Latent Class analysis on the 434,931 repositories with at least 33 events. I used counts of several event types as the explanatory variables – number of Push, Watch, Issue, Fork, Pull Request and Delete events. The graph below shows mean values for each of 18 latent classes – the percentage in the header of each pane is the percentage of repositories which were most likely to belong to that class/cluster.

Mean values for each of 18 latent classes

Mean values for each of 18 latent classes

There are a few broad patterns which I believe are informative here. Firstly, there are many classes (together accounting for a large proportion of all repositories) which are characterised by the dominance of Push events. Repositories in these clusters could be described as a-social – they have very low levels of all the social event types. Although these repositories are registered on github this is not affecting them – nobody is watching or forking the repository, it is not receiving pull requests. In practice these repositories are not gaining any of the potential benefits of being registered on github.

The remaining clusters are ‘social’ to varying degrees – clusters 15 and 16 represent repositories which have a high level of social events of all types considered here. Clusters 6, 10 and 17 (17 is dominated by a few outliers) are characterised by Issue events, suggesting that the main use these repositories have for github is as an issue tracker. There are also repository classes for which Watch events are their defining characteristic (clusters 4, 8, 11, 13, 16) – these repositories must be, or have been, relatively well-known among github users (perhaps they were featured somewhere which gained them publicity). A large number of Watch events is always accompanied by a substantial volume of other social event types, although this ratio fluctuates between classes.

Clusters 14, 15 and 16 have something to say about the highly active repositories (with more than 1,500 Push events each) which account for a lot of Github’s activity. These repositories can be ‘a-social’ (cluster 14) or they can have a substantial number of ‘social’ events, in the case of cluster 16 the number of push events being dwarfed by social events.

We could speculate that for a repo to be part of the ‘social coding scene’ on github it must first be seen (and watched) – whether a repo’s watchers will interact with it further is likely determined by characteristics of the repo itself (e.g. whether it is suited to receiving pull requests, whether it contains code which someone might wish to fork and develop further).

This analysis also suggests another way of looking at repositories – all of these effectively begin as ‘a-social’ and a significant proportion remain as such despite ongoing activity (Push events). It might be interesting to look at repos at the stage when they begin to receive ‘social’ events – with a view to understanding how this transition occurs.

Advertisements
July 3, 2013 / milllss

Some notes on github Pull Requests

Pull requests are an interesting aspect of ‘social coding’ on github. Any user can fork any repository, work on it, push their changes, and submit a Pull Request to the parent repository. The Pull Request is reviewed by the parent repository’s owner(s) and if accepted the changes can be automatically merged into the parent repository.

This sequence of events is known as the ‘Fork to Pull’ model and there are already some studies of its prevalence on github. It allows individuals who are not part of a project to contribute in an ad hoc manner, sometimes leading to sustained contribution, sometimes a one-off event (known as ‘drive-by commits’).

Pull Requests on the github timeline – accessed through BigQuery

Pull Requests are one of the event types recorded on the github timeline (which has around 3.4 million events of this type). There follows a rudimentary exploration of what the data on pull requests can tell us, based on a table of the most recent 100k Pull Request Events.

The first thing to note is that a row of data is written to the timeline whenever a Pull Request is opened or closed, so each Pull Request will tend to have two ‘Pull Request Events’ as recorded on the timeline. In the sample of 100k rows: 52,743 relate to the opening of a pull request, 46,630 relate to the closing of a pull request, 627 relate to the reopening of a pull request. Of the 46,630 events which relate to the closing of a pull request – the pull request has been merged in 33,567 cases (72%), and not merged in the remaining 13,204 cases (28%). Are these cases where the pull request has been rejected by the repo’s owner/maintainer?

When a pull request is submitted on github automated tests are run which check whether it is technically possible to merge the pull request – the result of this check is recorded in the ‘payload_pull_request_mergeable’ variable, which is False for 3,391 of the Pull Request Events.

Each row of timeline data for pull requests has a host of variables relating to the ‘head’ (repo which contains the changes to be merged) and ‘base’ (repo which changes are to be merged into) repositories. In 13,963 Pull Request events the head repo is the same as the base repo. These cases represent the use of pull requests where they are not required (the changes could have been pushed or merged by the contributor without a pull request being lodged and approved) – a divergence from the prototypical fork and pull request model. Similarly, there are 6,556 rows of data in this sample (14% of all rows which relate to the closing of a request) where the user who created the pull request is the same user who merged it.

A comment stumbled upon in the description of a repository (arsduo/newsgirl) supports the idea that some projects use pull requests (where they are not required) as a way to track changes to the master branch.

At 6Wunderkinder, no code goes into master except via pull request merged after group review. We’ve found this process to be very worthwhile both in ensuring quality (it’s a lot easier to raise a question in a safe group setting than one on one) and in diffusing knowledge of our systems to the entire team.

Pull Requests on ghtorrent

GHTorrent is (another) project which mirrors data from github. I came across it yesterday and I think it looks very useful – it stores the data in a variety of tables on a MySQL database, which can be queried in a browser window here or downloaded as sql dumps. GHTorrent collects data through the github API (for specifics of data collection see this document) – the data it currently holds is similar in timespan and breadth to that which is available through BigQuery for the github timeline.

GHTorrent’s pull_requests table has 1,527,377 rows, each relating to a single pull request. GHTorrent stores the history of a pull request (its open and close/merge events) in a separate table (pull_request_history). 1,393,610 (91%) of the pull requests have been closed, of these 59% have been merged and 41% have not been merged.

15% of all pull requests on ghtorrent are ‘intra-branch’ (assuming these are the cases where base and head repo are the same) – 61% of these have been merged, so only slightly higher than for the set of all pull requests.

Finally, this document on ghtorrent has a description of their data collection procedures and database schema – and also some interesting observations about github API data. The following passage is of particular relevance here, specifically to the pull requests identified above which are closed but not merged.

Indeed, several projects choose to track the discussion on pull requests using Github’s facilities while doing the actual merge using git. This behaviour can be observed in projects where an usually big number of pull requests are closed without being reported as merged. In such cases, we can deduce that a pull request has been merged by checking whether the commits (identified by their SHA id) appear in the main project’s repository (through a metadata query). However, this heuristic is not complete, as several projects use commit-squashing or even diff-based patching to transfer commits between branches, thereby loosing authorship information.

This suggests that not all of the pull requests which have been closed without being merged have been ‘rejected’ – there is another method through which they could be merged without leaving a trace in terms of Pull Request events.

June 28, 2013 / rian39

mortardata/gitrec · GitHub

https://github.com/mortardata/gitrec

This github repository has some possibly useful methods for sorting out repositories without knowing what they are about. It also uses the github archive data.

June 13, 2013 / rian39

Event types on github as at 13 June 2013

type count
PushEvent 50291125
CreateEvent 14497263
WatchEvent 9842336
IssueCommentEvent 8913780
IssuesEvent 5795101
ForkEvent 3658234
GistEvent 3447041
PullRequestEvent 3428669
FollowEvent 2508526
GollumEvent 1885043
CommitCommentEvent 1070755
PullRequestReviewCommentEvent 796819
DeleteEvent 671510
MemberEvent 613788
DownloadEvent 301471
PublicEvent 94007
ForkApplyEvent 5628
June 11, 2013 / rian39

Bigquery treatments of githubarchive data

Presentation of uses of Google bigquery on githubarchive data at a Strata conference last year lists a number of difference examples:

  1. GitHub activity dashboard
  2. Commit Logs From Last Night
  3. ProgLangVisualise
  4. Programming language popularity
  5. Organisation and programming languages
  6. Map of Github commits
  7. Digging for fork and pull requests
  8. On Testing Culture in GitHub Projects
  9. Something on using DNA sequence alignment methods

Slides for the talk are here: http://www.igvita.com/slides/2012/big…

Nearly all of the examples are interesting for us, and some worth following up

May 28, 2013 / milllss

Information on Open Source projects – Ohloh and FLOSSMetrics

In this post I will introduce two projects which use the data in repositories (among other sources) to produce metrics/descriptions relating to Open Source software projects. Both projects have the goal (among others) of allowing one to quickly gauge the ‘health’ of an Open Source project.

Ohloh is produced by Black Duck Software on a continuing basis. It uses the code stored on Subversion, CVS and Git repositories (among others) to produce profiles of projects which are being tracked. At the moment there are around 160k Git projects on Ohloh… so its not attempting to profile all of the projects on these repositories. The page for a given project has various measures – including the languages used by the project and its activity levels (lines of code, number of commits/contributors) over time. Ohloh also collects ratings and reviews for projects from its own members, and there are wiki-type pages where ohloh users can produce descriptions of the projects.

Ohloh also presents information on the basis of people, for a given individual one can see which projects they are involved in, which languages they write in, and when they have committed code on each project. There is also an ohloh-specific method of awarding ‘kudos’ to developers for their work on projects. At a more general level one can also compare the prevalence of different languages across all of the projects being tracked by Ohloh.

FLOSSMetrics was funded by the European Commission from 2006-2009 – data collected by the project can be obtained through the Melquiades sister site. FLOSSMetrics took a 3-pronged approach to the analysis of projects – looking at SCM (Source Code Management) repositories, Mailing Lists associated with the projects, and Bug Trackers. In total there are 2,630 projects for which data of at least one type has been collected –  for 1,527 of these the SCM system has been analysed. There are far fewer projects profiled on FLOSSMetrics than Ohloh, and the profiles themselves are less descriptive (and obviously quite dated now). As part of FLOSSMetrics several useful-looking tools were developed and there is also comprehensive documentation on how these tools work, how they can be used, and reports on some analyses which were conducted on data collected/processed with these tools. I think CVSAnalY looks particularly useful – it takes the history log for a project and generates a database containing fine-grained data on how that project has developed over time (e.g. actions, commits, contributors).

Ohloh also provides access to some of the tools it uses to generate metrics, and an API through which its own data can be queried. I haven’t fully explored these yet so cannot comment on their utility.

May 21, 2013 / rian39

A git-focused study of the Linux kernel

‘Effort estimation of FLOSS projects: a study of the Linux kernel,’ Empirical Software Engineering  (2013) 18:60–88,  DOI 10.1007/s10664-011-9191-7, Andrea Capiluppi & Daniel Izquierdo-Cortázar offers _some_ useful methods for working on github repositories. They were trying to measure the different patterns of work on the linux kernel, as hosted on Github.

One really useful observation they make is that ‘ In comparison with other
configuration management systems (such as CVS or SVN), a Git repository
retains the information about both the authors and their local submission dates,
rather than aggregating the latter into the central server’s time (Bird et al. 2009).
With this information, it is possible to group the developers’ effort based on
the effective time of the day when such actions were performed. This provides
valuable information when a distributed, trans-national development approach
is considered’ (61-62).

Seems to offer a very fine-grained way of looking at changes in repositories.