Embedding the GitHub contribution graph

Egor Bulychev, source{d}

Embedding the GitHub contribution graph

Egor Bulychev, source{d}

background

About me

Data Scientist, Machine Learning @ source{d}
Education: robotics
Switched to ML/DS in 2013

Plan

goo.gl/gtZbyc

Motivation
Dataset description
Algorithms
Algorithms: Swivel & node2vec
Training details
Results

Motivation

Labeled data is expensive
Unlabeled data is available
Most of the data has some structure
In most of the cases: item & context
This structural information can be represented as embeddings
We want to encode
- developer-repository relations in continuous vectors

Where to take the data?

Commit datasets:

GHTorrent (145M)
commits table in Google BigQuery - equivalent to GHTorrent
GitHub Archive (stream, unknown)
source{d}'s 452M dataset with de-fuzzy-forking

BigQuery will charge you money.

Dataset description

16.6M repositories
6.6M developers
452M commits
Bipartite graph - developers & repositories
Non planar graph

Graph visualization

How this was drawn.

Adjacency matrix

C_ij is the number of commits done by developer i to repository j

C_ij = C_ji

Non-zero elements: <10^-7

Algorithms

What to use: Swivel

Complexity: O(co-occurrence matrix) - useful on huge corpuses
Handling unobserved co-occurences: identity matching
Parallelism: co-occurence matrix -> shards
Quality: best results (according to paper)

What to use: node2vec + fastText

Complexity: O(size of the corpus)
Handling unobserved co-occurences: character ngram
Parallelism: like word2vec

Training details:

Filter out repositories with less than 6 contributors:
- 222k repositories
- 1.2M developers

Training details

Filtering
Graph -> corpus
Corpus -> co-occurrence matrix
Co-occurrence matrix -> shards
Random init
Iterative approximation of PMI

Training details

Filtering
Graph -> corpus
Random init
Iterative approximation of PMI

Training details

for repo in repos:
- for dev in repo.developers:
result: "repo dev_1 repo dev_1 repo dev_2"
result: ≈2 GB corpus

Training details

def random_walk(node, max_len):
- # There could be many random walk strategies
- # Our case: probability n1 -> n2 ~ n_commits(n1, n2) / n_commits(n2)
- return __alg__(node, max_len)

Training details

for node in graph:
- for _ in range(n_random_walks):
result: "repo1 dev1 repo42 dev137 repo167 dev42"
result: 10 random walks * 80 max_len ≈6 GB corpus

Training details

Hardware:
- 2 * Titan X 2016
- 2 * GTX 1080 Ti
- (Brewery)

Training details

Swivel:
- Embedding: 100
- Epochs: 40
- Duration: ≈0.5 hour preprocessing + ≈24 hours training
node2vec:
- Embedding: 100
- Epochs: 40
- Duration: ≈6 hours preprocessing + ≈2 hours fastText

Results: similar to tensorflow/tensorflow

0.7710 google/skflow
0.7211 tensorflow/models
0.5922 otherlab/geode
0.5459 pthulhu/eigen
0.5367 Luracast/Restler
0.5341 Blosc/c-blosc2

Results: similar to torch/nn

0.9398 torch/cunn
0.9082 torch/cutorch
0.8838 torch/torch.github.io
0.8836 torch/rocks
0.8813 torch/torch7
0.8563 torch/threads-ffi

0.9379 torch/cunn
0.9233 torch/torch7
0.9167 torch/rocks
0.9106 torch/image
0.9093 soumith/cudnn.torch
0.9029 torch/cutorch

Results: similar to zmatsh/Docker

Results: similar to BeerWithMe/beer-finder

Questions?

mail: egor@sourced.tech
LinkedIn: www.linkedin.com/in/egor-bulychev-028237118
GitHub: github.com/EgorBu
Twitter: twitter.com/egor_bu