Embedding the GitHub contribution graph

Egor Bulychev, source{d}

Embedding the GitHub contribution graph

Egor Bulychev, source{d}

background

About me

Plan

goo.gl/gtZbyc
  1. Motivation
  2. Dataset description
  3. Algorithms
  4. Algorithms: Swivel & node2vec
  5. Training details
  6. Results

Motivation

  1. Labeled data is expensive
  2. Unlabeled data is available
  3. Most of the data has some structure
  4. In most of the cases: item & context
  5. This structural information can be represented as embeddings
  6. We want to encode
    • developer-repository relations in continuous vectors

Where to take the data?

Commit datasets:

BigQuery will charge you money.

Dataset description

Graph visualization

How this was drawn.

Adjacency matrix

Cij is the number of commits done by developer i to repository j

Cij = Cji

Non-zero elements: <10-7

Algorithms

What to use: Swivel

What to use: node2vec + fastText

Training details:

Training details

    Swivel pipeline
  1. Filtering
  2. Graph -> corpus
  3. Corpus -> co-occurrence matrix
  4. Co-occurrence matrix -> shards
  5. Random init
  6. Iterative approximation of PMI

Training details

    node2vec pipeline
  1. Filtering
  2. Graph -> corpus
  3. Random init
  4. Iterative approximation of PMI

Training details

Training details

Training details

Training details

Training details

Results: similar to tensorflow/tensorflow

Results: similar to torch/nn

Results: similar to zmatsh/Docker

Results: similar to BeerWithMe/beer-finder

Questions?