Technological Stack for Machine Learning on Source Code

Egor Bulychev, source{d}

Read this on your device: goo.gl/4bgBX2

Content

About source{d}
Master plan
Common details & bblfsh
Topic modeling
Similar repository search
Future & opportunities
Neural code suggestion

About source{d}

"AI on code"
We started as a hiring middleware
You may have read "how developers change programming languages"
Open everything! web site, company rules, brewery
We have a technical blog

Master plan

Clone all software repositories in the world
Extract Abstract Syntax Trees (AST)
Crazy data science things
???
PROFIT

Machine learning on source code

Approaches to extract information from the source code

Treat it as regular text and feed it to the black box
Tokenize it using regexps (Pygments, highlight.js)
Parse ASTs
Compile

　

Universal Abstract Syntax Tree (UAST)

Shared node roles for every language

Code presented in the same format

			pip3 install bblfsh
			python3 -m bblfsh -f /path/to/file

You can try bblfsh/dashboard for UAST visualization

Source code classification

Problem: given millions of repositories, map their files to languages

github/linguist - Ruby
src-d/enry - Go

Identifiers in source code

Names rule.

Adam Tornhill, Your Code as a Crime Scene

			cat = models.Cat()
			cat.meow()

			def quick_sort(arr):
			    ...

Splitting

UpperCamelCase -> (upper, camel, case)
camelCase -> (camel, case)
FRAPScase -> (fraps, case)
SQLThing -> (sql, thing)
_Astra -> (astra)
CAPS_CONST -> (caps, const)
_something_SILLY_ -> (something, silly)
blink182 -> (blink)
FooBar100500Bingo -> (foo, bar, bingo)
Man45var -> (man, var)
method_name -> (method, name)
Method_Name -> (method, name)
101dalms -> (dalms)
101_dalms -> (dalms)
101_DalmsBug -> (dalms, bug)
101_Dalms45Bug7 -> (dalms, bug)
wdSize -> (size, wdsize)
Glint -> (glint)
foo_BAR -> (foo, bar)

Stemming

We apply Snowball stemmer to words longer than 6 chars

Weighted BOW

Let's interpret every repository as a weighted bag-of-words

We calculate TF-IDF to weigh the occurring identifiers

tensorflow/tensorflow:

    tfreturn	67.78
  oprequires	63.97
      doblas	63.71
    gputools	62.34
    tfassign	61.55
    opkernel	60.72
        sycl	57.06
         hlo	55.72
     libxsmm	54.82
  tfdisallow	53.67

Topic modeling

We can run topic modeling on these features

Paper on ArXiV

Run on the whole world -> funny tagging

Run per ecosystem -> useful exploratory search

Topic #36 "Movies" →

Clustering TM vectors

Topic modeling gives you vectors for every word and every file/repository
Cluster them with src-d/kmcuda or fbr/faiss
Map them to a regular grid with src-d/lapjv
Flatten dots for a chosen developer

CTO Máximo Cuadros' profile →

Topic modeling: Usage

			$ docker run srcd/github_topics tensorflow/tensorflow

word2vec

Family of algorithms which map words to dense vectors
These vectors allow linear algebra operations which make sense

Identifier embeddings

Use UAST to find context of identifier
Consider every preprocessed identifier as a word
Build the giant co-occurrence matrix
Factor (embed) it

Swivel

Really good embedding engine which works on co-occurrence matrices

Scales with a size of vocabulary

Like GloVe, but better - Paper on ArXiV

We forked the Tensorflow implementation

Linux kernel embeddings

Launch Tensorflow Projector

nBOW

What if we combine bag-of-words model with embeddings?

We focus on finding related repositories

Similar repository search:

Word Mover's Distance (WMD)

Based on the paper
An application of Earth Mover's Distance to sentences
Works on the nBOW model

WMD calculation in a nutshell

WMD evaluation is O(N³), becomes slow on N≈100

Sort samples by centroid distance
Evaluate first k WMDs
For every subsequent sample, solve the relaxed LP which gives a lower estimation
If it is greater than the farthest among the k NN, go next
Otherwise, evaluate the WMD

This allows to avoid 95% WMD evaluations on average

Similar repository search: usage

Babelfish currently supports only Python and Java, status

			pip3 install vecino
			python3 -m vecino https://github.com/pytorch/pytorch

Examples

Future & opportunities

Dependency tree building + libraries.io
Robust copy-paste detection
Source code generation in narrow area (pix2code)
Automatic test generation
Coding assistance: refactoring, suggestion, snippet search, review
Many, many more

Neural code suggestion

Deprecated approach: work on sequential token stream
- LSTM in Tensorflow
Modern approach: TreeLSTM
- PyTorch, TensorFlow/Fold - slow, immature
- We are still investigating

Summary

There is a funny set of tools and libraries devoted to MLoSC
They cover collection, extraction, engineering, training and running
There are datasets and pretrained models too
We believe that they will help developers program better and more efficiently

Poster resources

Poster
Similar repository search: article & code
Topic modeling: article & blog
Bblfsh: docs & client-python & article
A lot of interesting stuff: Awesome Machine Learning on Source Code

Bonus: Hercules

GitHub

Bonus: blog posts

Thank you

My contacts:

source{d} has a community slack!