Machine Learning on Source Code

Egor Bulychev, source{d}.

Machine Learning for Code

Egor Bulychev

#MLonCode

source{d}

Plan

What's MLonCode?

Stereotypes

Awesome MLonCode

Why MLonCode hard?

Data

Common datasets

Public Git Archive

Tools

Parsing

            def nearest_neighbors(self, origin, k=10,
                                  skipped_stop=0.99, throw=True):
    # origin can be either a text query or an id
    if isinstance(origin, (tuple, list):
	     words, weights = origin
	     weights = numpy.array(weights, dtype=numpy.float32)
            	     index = None
	     avg = self._get_centroid(words, weights, force=True)
    else:
	     index = origin
	     words, weights = self._get_vocabulary(index)
	     avg = self._get_centroid_by_index(index)
                if avg is None:
	     raise ValueError(
		     "Too little vocabulary for %s: %d" % (index, len(words)))
        

Parsing

            def nearest_neighbors(self, origin, k=10,
                                  skipped_stop=0.99, throw=True):
    # origin can be either a text query or an id
    if isinstance(origin, (tuple, list):
	     words, weights = origin
	     weights = numpy.array(weights, dtype=numpy.float32)
            	     index = None
	     avg = self._get_centroid(words, weights, force=True)
    else:
	     index = origin
	     words, weights = self._get_vocabulary(index)
	     avg = self._get_centroid_by_index(index)
                if avg is None:
	     raise ValueError(
		     "Too little vocabulary for %s: %d" % (index, len(words)))
		
				 keywords
				 identifiers
				 literals
				 strings
				 comments
                 reserved
			

AST

#words

#words

#words

#words

#words

#words

#words

Why so many identifiers?

            _tcp_socket_connect -> [tcp, socket, connect]
            AuthenticationError -> [authentication, error]
            set_visible -> [set, visible]
        

Code naturalness

class ???:
    def connect(self, dbname, user, password, host, port):
        # ...
    def query(self, sql):
        # ...
    def close(self):
        # ...
class Database:
            def connect(self, dbname, user, password, host, port):
                # ...
            def query(self, sql):
                # ...
            def close(self):
                # ...
        

Take away

Basics

Identifier splitter

Approaches

Heuristic - split by

Result: 49M -> 3M, but...

Machine learning - CNN/LSTM

FooBar -> X, y

BiLSTM

Result: 49M -> 1M

Identifier embeddings

Nearest to “foo”

Analogies

“send” - “receive” + “pop” = “push”

“database” - “query” + “tune” = “settings”

\( \begin{split} V_1 \Leftrightarrow & \,\texttt{"foo"} \\ \\ V_2 \Leftrightarrow & \,\texttt{"bar"} \\ \\ V_3 \Leftrightarrow & \,\texttt{"integrate"} \end{split} \)

\( distance(V_1, V_2) < distance(V_1, V_3) \)

\( distance(V_i, V_j) = \arccos \frac{V_i \cdot V_j}{\left\lVert V_i \right\rVert \left\lVert V_j \right\rVert} \)

Scalar product
Norm

How to estimate \(V_i \cdot V_j\) ?

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

Splitting and normalization

            _tcp_socket_connect -> [tcp, socket, connect]
            AuthenticationError -> [authentication, error]
            authentication, authenticate -> authenticate
        
class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

database, connect2, user2, password2, host2, port2, tcp, socket2, authenticate2, error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

connect2, user2, password2, host2, port2, tcp, socket2, authenticate2, error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

connect, user, password, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

tcp, socket, connect, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate2, user, password, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, user, password

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, error

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

socket, close

Incidence matrix \(C_{ij}\)

Pointwise Mutual Information (PMI)

$$ V_i \cdot V_j = PMI_{ij} = \log\frac{C_{ij} \sum C}{\sum_{k = 1}^N C_{ik}\sum_{k = 1}^N C_{jk}} $$

Representation Learning on Explicit Matrix

Stochastic Gradient Descent

Swivel

Similar repository search

torch/nn

src-d/go-git

Pipeline

  1. Clone the repositories.
  2. Classify and parse source code.
  3. Preprocess identifiers: split, normalize, filter, stem.
  4. Prepare embeddings - cooc matrix & Swivel.
  5. Weighted BOW per repository.
  6. WMD.

Training identifier embeddings

Stage Time Resources Size on disk
Cloning 143k repos 3 days 20x2 cores, 256 GB RAM 2.6 TB
Dataset 4 days 20x2 cores, 256 GB RAM 2TB (31GB in xz)
fastprep 2 days 16x2 cores, 256 GB RAM 20 GB
Swivel 14 hours 2 Titan X'2016 + 2 1080Ti 5.6 GB

Embeddings

Weighted nBOW

Let's interpret every repository as the weighted bag-of-words.

We calculate TF-IDF to weight the occurring identifiers.

tensorflow/tensorflow:
    tfreturn	67.780249
  oprequires	63.968142
      doblas	63.714424
    gputools	62.337396
    tfassign	61.545926
    opkernel	60.721556
        sycl	57.064558
         hlo	55.723587
     libxsmm	54.820668
  tfdisallow	53.666890

Word Mover's Distance

WMD calculation in a nutshell

WMD evaluation is O(N3), becomes slow on N≈100.

  1. Sort samples by centroid distance.
  2. Evaluate first k WMDs.
  3. For every subsequent sample, solve the relaxed LP which gives an upper estimation.
  4. If it is greater than the farthest among the k NN, go next.
  5. Otherwise, evaluate the WMD.

This allows to avoid 95% WMD evaluations on average.

Code review

Typos correction inside code identifiers

class ClasName:
    @classmethod
                def function_name(cls) -> str:
                    varyable_name = "Hello, I'm ClassName object!"
                    return varyable_name

Typos correction

funktion    function

GetValu    GetValue

str_lenght    str_length

Typos correction inside code identifiers

class ClassName:
    @classmethod
                def function_name(cls) -> str:
                    variable_name = "Hello, I'm ClassName object!"
                    return variable_name

The correction pipeline

Candidates generation

  1. Find candidates with the SymSpell
  2. Identifier to which the token belongs - a context
  3. Features: based on the frequencies, embeddings, similarity and edit distance between the elements

Candidates ranking

Suggestions

Training and testing

Results on a token level

Note: the test dataset is not a ground truth.

Developer similarity

Pipeline

  1. Clone the organization.
  2. Classify and parse source code.
  3. Extract identifiers per file.
  4. Extract experience per developer - languages, files, etc.
  5. Topic modeling.
  6. Usage: KD-tree, clustering and so on.

From git history

Dev\Files File 1 File 2 File 3
Dev 1 0 0.1 0.8
Dev 2 0.4 0.1 0.5

From head revision

Files\Tokens Token 1 Token 2 Token 3
File 1 0 0.1 0.8
File 2 0.4 0.1 0.56
File 3 0.43 0.18 0.9

Result after multiplication

Dev\Token Token 1 Token 2 Token 3
Dev 1 0 0.1 0.8
Dev 2 0.4 0.1 0.5

Summary

Idea ?  Doubt ?  Write us !

Thank you

https://bit.ly/2lOQAw0