Technological Stack for Machine Learning on Source Code
Egor Bulychev, source{d}
Technological Stack for Machine Learning on Source Code
Egor Bulychev, source{d}
Content
- About source{d}
- Master plan
- Common details & bblfsh
- Topic modeling
- Similar repository search
- Future & opportunities
- Neural code suggestion
Master plan
- Clone all software repositories in the world
- Extract Abstract Syntax Trees (AST)
- Crazy data science things
- ???
- PROFIT
Machine learning on source code
Approaches to extract information from the source code
- Treat it as regular text and feed it to the black box
- Tokenize it using regexps (Pygments,
highlight.js)
- Parse ASTs
- Compile
Universal Abstract Syntax Tree (UAST)
Shared node roles for every language
Code presented in the same format
pip3 install bblfsh
python3 -m bblfsh -f /path/to/file
You can try bblfsh/dashboard
for UAST visualization
Source code classification
Problem: given millions of repositories, map their files to languages
Identifiers in source code
cat = models.Cat()
cat.meow()
def quick_sort(arr):
...
Splitting
UpperCamelCase -> (upper, camel, case)
camelCase -> (camel, case)
FRAPScase -> (fraps, case)
SQLThing -> (sql, thing)
_Astra -> (astra)
CAPS_CONST -> (caps, const)
_something_SILLY_ -> (something, silly)
blink182 -> (blink)
FooBar100500Bingo -> (foo, bar, bingo)
Man45var -> (man, var)
method_name -> (method, name)
Method_Name -> (method, name)
101dalms -> (dalms)
101_dalms -> (dalms)
101_DalmsBug -> (dalms, bug)
101_Dalms45Bug7 -> (dalms, bug)
wdSize -> (size, wdsize)
Glint -> (glint)
foo_BAR -> (foo, bar)
Stemming
We apply Snowball stemmer to words longer than 6 chars
Weighted BOW
Let's interpret every repository as a weighted bag-of-words
We calculate TF-IDF to weigh the occurring identifiers
tensorflow/tensorflow:
tfreturn 67.78
oprequires 63.97
doblas 63.71
gputools 62.34
tfassign 61.55
opkernel 60.72
sycl 57.06
hlo 55.72
libxsmm 54.82
tfdisallow 53.67
Topic modeling
We can run topic modeling on these features
Paper on ArXiV
Run on the whole world -> funny tagging
Run per ecosystem -> useful exploratory search
Topic #36 "Movies" →
Clustering TM vectors
- Topic modeling gives you vectors for every word and every file/repository
- Cluster them with src-d/kmcuda or
fbr/faiss
- Map them to a regular grid with src-d/lapjv
- Flatten dots for a chosen developer
CTO Máximo Cuadros' profile →
Topic modeling: Usage
$ docker run srcd/github_topics tensorflow/tensorflow
word2vec
- Family of algorithms which map words to dense vectors
- These vectors allow linear algebra operations which make sense
Identifier embeddings
- Use UAST to find context of identifier
- Consider every preprocessed identifier as a word
- Build the giant co-occurrence matrix
- Factor (embed) it
Swivel
Really good embedding engine which works on co-occurrence matrices
Scales with a size of vocabulary
Like GloVe, but better - Paper on ArXiV
We forked the Tensorflow implementation
nBOW
What if we combine bag-of-words model with embeddings?
We focus on finding related repositories
Similar repository search:
Word Mover's Distance (WMD)
- Based on the paper
- An application of Earth Mover's Distance to sentences
- Works on the nBOW model
WMD calculation in a nutshell
WMD evaluation is O(N3), becomes slow on N≈100
- Sort samples by centroid distance
- Evaluate first k WMDs
- For every subsequent sample, solve the relaxed LP which gives a lower estimation
- If it is greater than the farthest among the k NN, go next
- Otherwise, evaluate the WMD
This allows to avoid 95% WMD evaluations on average
Similar repository search: usage
Babelfish currently supports only Python and Java, status
pip3 install vecino
python3 -m vecino https://github.com/pytorch/pytorch
Examples
Future & opportunities
- Dependency tree building + libraries.io
- Robust copy-paste detection
- Source code generation in narrow area (pix2code)
- Automatic test generation
- Coding assistance: refactoring, suggestion, snippet search, review
- Many, many more
Neural code suggestion
- Deprecated approach: work on sequential token stream
- Modern approach: TreeLSTM
- PyTorch, TensorFlow/Fold - slow, immature
- We are still investigating
Summary
- There is a funny set of tools and libraries devoted to MLoSC
- They cover collection, extraction, engineering, training and running
- There are datasets and pretrained models too
- We believe that they will help developers program better and more efficiently