Embedding the GitHub contribution graph

Egor Bulychev, source{d}

Embedding the GitHub contribution graph

Egor Bulychev, source{d}


About me


  1. Motivation
  2. Dataset description
  3. Algorithms
  4. Algorithms: Swivel & node2vec
  5. Training details
  6. Results


  1. Labeled data is expensive
  2. Unlabeled data is available
  3. Most of the data has some structure
  4. In most of the cases: item & context
  5. This structural information can be represented as embeddings
  6. We want to encode
    • developer-repository relations in continuous vectors

Where to take the data?

Commit datasets:

BigQuery will charge you money.

Dataset description

Graph visualization

How this was drawn.

Adjacency matrix

Cij is the number of commits done by developer i to repository j

Cij = Cji

Non-zero elements: <10-7


What to use: Swivel

What to use: node2vec + fastText

Training details:

Training details

    Swivel pipeline
  1. Filtering
  2. Graph -> corpus
  3. Corpus -> co-occurrence matrix
  4. Co-occurrence matrix -> shards
  5. Random init
  6. Iterative approximation of PMI

Training details

    node2vec pipeline
  1. Filtering
  2. Graph -> corpus
  3. Random init
  4. Iterative approximation of PMI

Training details

Training details

Training details

Training details

Training details

Results: similar to tensorflow/tensorflow

Results: similar to torch/nn

Results: similar to zmatsh/Docker

Results: similar to BeerWithMe/beer-finder
