Machine Learning on Source Code

Egor Bulychev, source{d}.

Machine Learning on Source Code

Egor Bulychev

@egor_bu

#MLonCode

Egor Bulychev

@egor_bu

🚀 Machine Learning

GitHub

Machine Learning on Source Code

MLonCode

Examples

Code naturalness

class ???:
    def connect(self, dbname, user, password, host, port):
        # ...
    def query(self, sql):
        # ...
    def close(self):
        # ...
class Database:
            def connect(self, dbname, user, password, host, port):
                # ...
            def query(self, sql):
                # ...
            def close(self):
                # ...
        
class Foo:
            def bar(self, qux):
                # ...
            def baz(self, waldo):
                # ...
            def do(self, really):
                # ...
        

Projects similar to MariaDB/server?


Details

Exploratory search

Similar code detection

provides us

Code autocompletion

Code style

class foobar:
    def connecttoserver(self):
        myserverhost = globalconfig.server.host
        
class FooBar:
    def connect_to_server(self):
        myServerHost = globalConfig.server.host
        

Many other applications

Your code as a crime scene

Data

Datasets

PGA

Details in the paper.

Tools for MLonCode

MLonCode logo

Plumbing

source{d} engine

GitHub

source{d} engine

>>> from sourced.engine import Engine
>>> engine = Engine(spark, "/path/to/siva/files", "siva")
>>> engine.repositories.references.head_ref \
    .commits.tree_entries.blobs \
    .classify_languages() \
    .select("blob_id", "path", "lang") \
    .show()

AST

How to parse

Universal AST

dashboard.bblf.sh
>>> engine.repositories.references.head_ref \
    .commits.tree_entries.blobs \
    .classify_languages() \
    .filter('lang = "Python"') \
    .extract_uasts() \
    .query_uast('//*[@roleIdentifier]') \
    .extract_tokens("result", "tokens") \
    .select("blob_id", "path", "tokens")
        

Powerful

modelforge

GitHub

Machine Learning

hercules

GitHub

MLonCode in hercules

source{d} ml

GitHub

Solved practical problems

Identifier embeddings

\( \begin{split} V_1 \Leftrightarrow & \,\texttt{"foo"} \\ \\ V_2 \Leftrightarrow & \,\texttt{"bar"} \\ \\ V_3 \Leftrightarrow & \,\texttt{"integrate"} \end{split} \)

\( distance(V_1, V_2) < distance(V_1, V_3) \)

\( distance(V_i, V_j) = \arccos \frac{V_i \cdot V_j}{\left\lVert V_i \right\rVert \left\lVert V_j \right\rVert} \)

Scalar product
Norm

How to estimate \(V_i \cdot V_j\) ?

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

Splitting and normalization

            _tcp_socket_connect -> [tcp, socket, connect]
            AuthenticationError -> [authentication, error]
            authentication, authenticate -> authenticate
        
class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

database, connect2, user2, password2, host2, port2, tcp, socket2, authenticate2, error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

connect2, user2, password2, host2, port2, tcp, socket2, authenticate2, error, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

connect, user, password, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

tcp, socket, connect, host, port

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate2, user, password, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, user, password

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, error, socket, close

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

authenticate, error

class Database:
    def connect(self, user, password, host, port):
        self._tcp_socket_connect(host, port)
        try:
            self._authenticate(user, password)
        except AuthenticationError as e:
            self.socket.close()
            raise e from None
        

socket, close

Incidence matrix \(C_{ij}\)

Pointwise Mutual Information (PMI)

$$ V_i \cdot V_j = PMI_{ij} = \log\frac{C_{ij} \sum C}{\sum_{k = 1}^N C_{ik}\sum_{k = 1}^N C_{jk}} $$

Representation Learning on Explicit Matrix

Stochastic Gradient Descent

Swivel

Nearest to “foo”

Analogies

“bug” - “test” + “expect” = “suppress”

“database” - “query” + “tune” = “settings”

“send” - “receive” + “pop” = “push”

Typos

Summary

Summary

Idea ?  Doubt ?  Write us !

Thank you

bit.ly/2MAU7qG