Workshop given at DevFest Nantes 2019.
Slides: on gDrive
OSS tools covered:
Abstract
Machine Learning on Source Code (MLonCode) is an emerging and exciting research domain which stands at the sweet spot between deep learning, natural language processing, social science, and programming.
During this 2 hours workshop, we are going to show you how to extract insights from code bases—step by step—by shedding light on those crucial aspects:
- What information is available in your code
- How to extract this information
- What can you do with this knowledge: what are the tasks solvable by MLonCode
- Which models can be used to solve them
To get our hands dirty, we will solve several example tasks, using source{d}, an open source stack to gain insights from codebases:
- Suggest function names automatically
- Cluster developers
- Search projects by similarity
Prerequisites: a laptop with Docker installed. We will provide an image to all participants.
Slides: on gDrive
- Docker
Import Docker images (works offline):
docker load -i images/jupyter.tgz
docker load -i images/gitbase.tgz
docker load -i images/bblfshd-with-drivers.tgz
docker images
Run bblfsh
docker run \
--detach \
--rm \
--name devfest_bblfshd \
--privileged \
--publish 9432:9432 \
bblfsh/bblfshd:v2.15.0-drivers \
--log-level DEBUG
Run gitbase
docker run \
--detach \
--rm \
--name devfest_gitbase \
--publish 3306:3306 \
--link devfest_bblfshd:devfest_bblfshd \
--env BBLFSH_ENDPOINT=devfest_bblfshd:9432 \
--env MAX_MEMORY=1024 \
--volume $(pwd)/repos/git-data:/opt/repos \
srcd/gitbase:v0.24.0-rc2
Run the jupyter image
docker run \
--rm \
--name devfest_jupyter \
--publish 8888:8888 \
--link devfest_bblfshd:devfest_bblfshd \
--link devfest_gitbase:devfest_gitbase \
--volume $(pwd)/notebooks:/devfest/notebooks \
--volume $(pwd)/repos:/devfest/repos \
mloncode/devfest
With make
To build the workshop image and launch the 3 required containers
make build-and-run
To only launch the 3 required containers
make
We are going to use top 50 repositories from Apache Software Foundation though this workshop.
Notebook 1: data collection pipeline
Build a vector model for projects and developers using Topic Modelling of code identifiers.
Notebook 2: project and developer similarities
Train a NMT seq2seq model for predicting method names based on identifiers in method bodies.