This repository is the main codebase for the Mosaic project, containing both the utility used to generate and convert graphs as well as the main graph engine, Mosaic.
This version of Mosaic should be runnable on most Linux-based systems. Mosaic builds for both the CPU-based version as well as the Xeon Phi-based version from almost the same code-base, although the Xeon Phi-based version requires a specialized hardware setup and a modified Kernel to support our 9p-based file system. If you would like to setup the Xeon Phi-based version, please get in touch with us.
- Cmake 2.8 or newer
- A recent gcc (4.8+) or icc
- Googletest needs to be installed, this can be done following these instructions.
To build Mosaic, follow these steps:
$ cd src
$ make cmake
$ make
This will build the binaries of Mosaic, in a release and debug configuration into the build-folder.
Mosaic supports multiple configurations, for different machines, to co-exist
and to be picked via the hostname of the machine.
As such, we recommend localizing the configuration in the config/
directory
to suit the needs of your environment (e.g., if your server is called foo
,
create config/config_foo.py
).
We provide an example with the configuration of our ramjet
server, as
presented in the paper (config/config_ramjet.py
).
In the paper, we use six datasets (3 real-world, 3 synthetic):
- uk-2007: http://law.di.unimi.it/webdata/uk-2007-05/
- Twitter: http://an.kaist.ac.kr/traces/WWW2010.html
- Hyperlinks 2014: http://webdatacommons.org/hyperlinkgraph/2014-04/download.html
- rmat-24, rmat-27 and rmat-trillion are generated using the Mosaic generator itself, with the same parameters as graph500.
A smaller test dataset we use is twitter_rv_small.
These graphs are already included in the configuration of Mosaic.
If you wish to add another graph dataset, it has to be added in the
configuration files.
This is done adding the appropriate settings in config/default.py
(i.e., the
variables SG_INPUT_WEIGHTED
, SG_INPUT_FILE
and SG_GRAPH_SETTINGS_DELIM
),
following a similar pattern as the other graphs.
For faster generation, Mosaic can read graphs from a binary format (the same as GridGraph, in case one happens to use both for benchmarking purposes). We provide a script to convert txt-based graphs to the binary format:
$ cd src/tools/scripts
$ ./convert_graph.py --input /data/graph/input/twitter-small/twitter_rv_small.net --output /data/graph/input/twitter-small/twitter_rv_small.bin
With the configuration setup correctly, generating the Mosaic-internal format
of a graph can be started via the src/tools/scripts/generate_graph.py
script.
Examples of this conversion:
$ cd src/tools/scripts
$ ./generate_graph.py --dataset uk2007 --binary --in-memory
$ ./generate_graph.py --dataset uk2007 --binary --in-memory --weighted
$ ./generate_graph.py --dataset rmat-27 --rmat --in-memory
Mosaic supports using multiple disk-paths for input files.
This is controlled via the parameter SG_NMIC
which sets the number of edge
processors that will be started.
Each edge processor can read either of the tile and meta data from a
different partition.
This is controlled by the SG_DATAPATH_VERTEX_ENGINE
configuration (for the
meta-data) and the SG_DATAPATH_EDGE_ENGINE
configuration for the tile data.
After setting these parameters, the input-data from the converted graph
datasets needs to be copied into the correct file system locations (as set by
the SG_DATAPATH_VERTEX_ENGINE
and SG_DATAPATH_EDGE_ENGINE
configuration).
This is done via the rebalance script, which needs to be run anytime the
data path or the number of edge processors (SG_NMIC
configuration) is changed.
$ cd src/tools/scripts
$ ./rebalance_input.py --dataset twitter-small
$ ./rebalance_input.py --dataset twitter-small-weighted
This optimization enables Mosaic to skip the computation for tiles without
active vertices and edges.
To build the index structure needed for this optimization (a very useful
optimization for BFS, Connected Components, ...), run the following commands
(specific to every dataset, here twitter-small
):
$ cd src/tools/scripts
$ ./run_tiles_indexer.py --dataset twitter-small
Finally, Mosaic is ready to run, e.g. for executing Pagerank or Single-Source
Shortest Path on the twitter-small
dataset:
$ cd src/tools/scripts
$ ./run_mosaic.py --dataset twitter-small --algorithm pagerank --max-iteration 20
$ ./run_mosaic.py --dataset twitter-small-weighted --algorithm sssp --max-iteration 50
Mosaic will present statistics on the execution time per iteration and optionally log the results for later analysis.
To run a benchmark suite of algorithms and datasets as well as a quick analysis of these, we provide two scripts, you might need to customize the datasets and algorithms executed and analyzed by these scripts to match your configuration:
$ cd src/tools/scripts
$ ./run_benchmark.py
$ ./parse_log.py
- Steffen Maass [email protected]
- Changwoo Min [email protected]
- Sanidhya Kashyap [email protected]
- Woonhak Kang [email protected]
- Mohan Kumar [email protected]
- Taesoo Kim [email protected]
- EuroSys'17 (Best Student Paper):
@inproceedings{maass:mosaic,
title = {{Mosaic: Processing a Trillion-Edge Graph on a Single Machine}},
author = {Steffen Maass and Changwoo Min and Sanidhya Kashyap and Woonhak Kang and Mohan Kumar and Taesoo Kim},
booktitle = {Proceedings of the 12th European Conference on Computer Systems (EuroSys)},
month = apr,
year = 2017,
address = {Belgrade, RS},
}