Machine Learning on Source Code

Machine Learning for Code

Egor Bulychev

#MLonCode

source{d}

Plan

What's MLonCode?
Why MLonCode hard?
Basics
Similar repository search
Code review
Developer similarity

What's MLonCode?

Stereotypes

Awesome MLonCode

Program Synthesis
Program Translation
Code Suggestion and Completion
Program Repair and Bug Detection

Why MLonCode hard?

Data

Common datasets

Public Git Archive

Size: 6 TB
Number of repositories: 260k+ (out of 96 M+) from GitHub
Number of files: 136M+ files
LOC: ~28 billions

Tools

Parsing

            def nearest_neighbors(self, origin, k=10,
                                  skipped_stop=0.99, throw=True):
    # origin can be either a text query or an id
    if isinstance(origin, (tuple, list):
	     words, weights = origin
	     weights = numpy.array(weights, dtype=numpy.float32)
            	     index = None
	     avg = self._get_centroid(words, weights, force=True)
    else:
	     index = origin
	     words, weights = self._get_vocabulary(index)
	     avg = self._get_centroid_by_index(index)
                if avg is None:
	     raise ValueError(
		     "Too little vocabulary for %s: %d" % (index, len(words)))

Parsing

            def nearest_neighbors(self, origin, k=10,
                                  skipped_stop=0.99, throw=True):
    # origin can be either a text query or an id
    if isinstance(origin, (tuple, list):
	     words, weights = origin
	     weights = numpy.array(weights, dtype=numpy.float32)
            	     index = None
	     avg = self._get_centroid(words, weights, force=True)
    else:
	     index = origin
	     words, weights = self._get_vocabulary(index)
	     avg = self._get_centroid_by_index(index)
                if avg is None:
	     raise ValueError(
		     "Too little vocabulary for %s: %d" % (index, len(words)))

				⬤ keywords
				⬤ identifiers
				⬤ literals
				⬤ strings
				⬤ comments
                ⬤ reserved

AST

#words

Why so many identifiers?

            _tcp_socket_connect -> [tcp, socket, connect]
            AuthenticationError -> [authentication, error]
            set_visible -> [set, visible]

Code naturalness

class ???:
    def connect(self, dbname, user, password, host, port):
        # ...
    def query(self, sql):
        # ...
    def close(self):
        # ...

class Database:
            def connect(self, dbname, user, password, host, port):
                # ...
            def query(self, sql):
                # ...
            def close(self):
                # ...

Take away

Identifier ~ combinations of words
Raw code is text
Parsed code is AST
Non-linear structure of AST
Code naturalness
370 programming languages

Basics

Identifier splitter

Approaches

Heuristic
Machine Learning

Heuristic - split by

underscore set_name -> [set, name]
changing of cases SetName -> [set, name]
etc

Result: 49M -> 3M, but...

BUFFERFLAG_CODECCONFIG -> [bufferflag, codecconfig]
metamodelength -> [metamodelength]

Machine learning - CNN/LSTM

Data - 49M of identifiers
Preprocessing
Train NN

FooBar -> X, y

f
o
o
b
a
r

0
0
0
1
0
0

BiLSTM

Result: 49M -> 1M

BUFFERFLAG_CODECCONFIG -> [buffer, flag, codec, config]
metamodelength -> [meta, mode, length]

Identifier embeddings

Nearest to “foo”

afoo
qux
myfoo
baz
mfoo
wibble
dofoo
quux
testing

Analogies

“send” - “receive” + “pop” = “push”

“database” - “query” + “tune” = “settings”

$ \begin{split} V_1 \Leftrightarrow & \,\texttt{"foo"} \\ \\ V_2 \Leftrightarrow & \,\texttt{"bar"} \\ \\ V_3 \Leftrightarrow & \,\texttt{"integrate"} \end{split} $

$ distance(V_1, V_2) < distance(V_1, V_3) $

$ distance(V_i, V_j) = \arccos \frac{V_i \cdot V_j}{\left\lVert V_i \right\rVert \left\lVert V_j \right\rVert} $

Scalar product

Norm

How to estimate $V_i \cdot V_j$ ?

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

Splitting and normalization

            _tcp_socket_connect -> [tcp, socket, connect]
            AuthenticationError -> [authentication, error]
            authentication, authenticate -> authenticate

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

database, connect², user², password², host², port², tcp, socket², authenticate², error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

connect², user², password², host², port², tcp, socket², authenticate², error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

connect, user, password, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

tcp, socket, connect, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

authenticate², user, password, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

authenticate, user, password

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

authenticate, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

authenticate, error

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None

socket, close

Incidence matrix $C_{ij}$

$C_{ij} =$ number of times $i$ and $j$ were together
Also known as the co-occurrence matrix

Pointwise Mutual Information (PMI)

$$ V_i \cdot V_j = PMI_{ij} = \log\frac{C_{ij} \sum C}{\sum_{k = 1}^N C_{ik}\sum_{k = 1}^N C_{jk}} $$

Representation Learning on Explicit Matrix

Stochastic Gradient Descent

Swivel

Multi-GPU
Multi-node
Quality tricks
Swivel: Improving Embeddings by Noticing What's Missing by Shazeer et.al.
Tensorflow implementation

Similar repository search

torch/nn

src-d/go-git

Pipeline

Clone the repositories.
Classify and parse source code.
Preprocess identifiers: split, normalize, filter, stem.
Prepare embeddings - cooc matrix & Swivel.
Weighted BOW per repository.
WMD.

Training identifier embeddings

Stage	Time	Resources	Size on disk
Cloning 143k repos	3 days	20x2 cores, 256 GB RAM	2.6 TB
Dataset	4 days	20x2 cores, 256 GB RAM	2TB (31GB in xz)
fastprep	2 days	16x2 cores, 256 GB RAM	20 GB
Swivel	14 hours	2 Titan X'2016 + 2 1080Ti	5.6 GB

Embeddings

Weighted nBOW

Let's interpret every repository as the weighted bag-of-words.

We calculate TF-IDF to weight the occurring identifiers.

tensorflow/tensorflow:

    tfreturn	67.780249
  oprequires	63.968142
      doblas	63.714424
    gputools	62.337396
    tfassign	61.545926
    opkernel	60.721556
        sycl	57.064558
         hlo	55.723587
     libxsmm	54.820668
  tfdisallow	53.666890

Word Mover's Distance

WMD calculation in a nutshell

WMD evaluation is O(N³), becomes slow on N≈100.

Sort samples by centroid distance.
Evaluate first k WMDs.
For every subsequent sample, solve the relaxed LP which gives an upper estimation.
If it is greater than the farthest among the k NN, go next.
Otherwise, evaluate the WMD.

This allows to avoid 95% WMD evaluations on average.

Code review

Typos correction inside code identifiers

class ClasName:
    @classmethod
                def function_name(cls) -> str:
                    varyable_name = "Hello, I'm ClassName object!"
                    return varyable_name

Typos correction

funktion ➙ function

GetValu ➙ GetValue

str_lenght ➙ str_length

Typos correction inside code identifiers

class ClassName:
    @classmethod
                def function_name(cls) -> str:
                    variable_name = "Hello, I'm ClassName object!"
                    return variable_name

The correction pipeline

Candidates generation

Find candidates with the SymSpell
Identifier to which the token belongs - a context
Features: based on the frequencies, embeddings, similarity and edit distance between the elements

Candidates ranking

Binary classification model on tuples (identifier, token, candidate):
- label=1 if the candidate is the correct suggestion
- label=0 otherwise
Group by the token and sort
XGBoost is used for binary classification

Suggestions

Suggested token itself - not correcting
Otherwise return corrections

Training and testing

Train and test datasets are sampled from the dataset of identifiers
Random artificial typos
Dataset itself contains typos

Results on a token level

Detection precision - 98%
Detection recall - 82%
Top3 correction accuracy - 95%

Note: the test dataset is not a ground truth.

Developer similarity

Pipeline

Clone the organization.
Classify and parse source code.
Extract identifiers per file.
Extract experience per developer - languages, files, etc.
Topic modeling.
Usage: KD-tree, clustering and so on.

From git history

Dev\Files	File 1	File 2	File 3
Dev 1	0	0.1	0.8
Dev 2	0.4	0.1	0.5

From head revision

Files\Tokens	Token 1	Token 2	Token 3
File 1	0	0.1	0.8
File 2	0.4	0.1	0.56
File 3	0.43	0.18	0.9

Result after multiplication

Dev\Token	Token 1	Token 2	Token 3
Dev 1	0	0.1	0.8
Dev 2	0.4	0.1	0.5

Summary

MLonCode is fun
There is data
There are tools
Community is forming

Idea ? Doubt ? Write us !

Thank you

https://bit.ly/2lOQAw0

Machine Learning for Code

#MLonCode

Plan

What's MLonCode?

Stereotypes

Awesome MLonCode

Why MLonCode hard?

Data

Common datasets

Public Git Archive

Tools

Parsing

Parsing

AST

#words

#words

#words

#words

#words

#words

#words

Why so many identifiers?

Code naturalness

Take away

Basics

Identifier splitter

Approaches

Heuristic - split by

Result: 49M -> 3M, but...

Machine learning - CNN/LSTM

FooBar -> X, y

BiLSTM

Result: 49M -> 1M

Identifier embeddings

Nearest to “foo”

Analogies

How to estimate \(V_i \cdot V_j\) ?

Splitting and normalization

Incidence matrix \(C_{ij}\)

Pointwise Mutual Information (PMI)

Representation Learning on Explicit Matrix

Stochastic Gradient Descent

Swivel

Similar repository search

torch/nn

src-d/go-git

Pipeline

Training identifier embeddings

Embeddings

Weighted nBOW

Word Mover's Distance

WMD calculation in a nutshell

Code review

Typos correction inside code identifiers

Typos correction

Typos correction inside code identifiers

The correction pipeline

Candidates generation

Candidates ranking

Suggestions

Training and testing

Results on a token level

Developer similarity

Pipeline

Summary

Thank you