Technological Stack for Machine Learning on Source Code

Egor Bulychev, source{d}

Technological Stack for Machine Learning on Source Code

Egor Bulychev, source{d}

Read this on your device: goo.gl/4bgBX2

Content

About source{d}

Master plan

Machine learning on source code

Approaches to extract information from the source code

  1. Treat it as regular text and feed it to the black box
  2. Tokenize it using regexps (Pygments, highlight.js)
  3. Parse ASTs
  4. Compile

 

Universal Abstract Syntax Tree (UAST)

Shared node roles for every language

Code presented in the same format

			pip3 install bblfsh
			python3 -m bblfsh -f /path/to/file
		
You can try bblfsh/dashboard for UAST visualization

Source code classification

Problem: given millions of repositories, map their files to languages

Identifiers in source code

Names rule.

Adam Tornhill, Your Code as a Crime Scene
			cat = models.Cat()
			cat.meow()
		
			def quick_sort(arr):
			    ...
		

Splitting

Stemming

We apply Snowball stemmer to words longer than 6 chars

Weighted BOW

Let's interpret every repository as a weighted bag-of-words

We calculate TF-IDF to weigh the occurring identifiers

tensorflow/tensorflow:
    tfreturn	67.78
  oprequires	63.97
      doblas	63.71
    gputools	62.34
    tfassign	61.55
    opkernel	60.72
        sycl	57.06
         hlo	55.72
     libxsmm	54.82
  tfdisallow	53.67

Topic modeling

We can run topic modeling on these features

Paper on ArXiV

Run on the whole world -> funny tagging

Run per ecosystem -> useful exploratory search

Topic #36 "Movies" →

Clustering TM vectors

CTO Máximo Cuadros' profile →

Topic modeling: Usage

			$ docker run srcd/github_topics tensorflow/tensorflow
		

word2vec

Identifier embeddings

Swivel

Really good embedding engine which works on co-occurrence matrices

Scales with a size of vocabulary

Like GloVe, but better - Paper on ArXiV

We forked the Tensorflow implementation

Linux kernel embeddings

Launch Tensorflow Projector

nBOW

What if we combine bag-of-words model with embeddings?

We focus on finding related repositories

Similar repository search:

Word Mover's Distance (WMD)

WMD calculation in a nutshell

WMD evaluation is O(N3), becomes slow on N≈100

  1. Sort samples by centroid distance
  2. Evaluate first k WMDs
  3. For every subsequent sample, solve the relaxed LP which gives a lower estimation
  4. If it is greater than the farthest among the k NN, go next
  5. Otherwise, evaluate the WMD

This allows to avoid 95% WMD evaluations on average

Similar repository search: usage

Babelfish currently supports only Python and Java, status

			pip3 install vecino
			python3 -m vecino https://github.com/pytorch/pytorch
		
Examples

Future & opportunities

Neural code suggestion

Summary

Poster resources

Bonus: Hercules

GitHub

Bonus: blog posts

Thank you

My contacts:

source{d} has a community slack!