How to run
Reproduce the experimental results in the paper
run scripts/setup_tbb.sh to set up TBB library path before you run TurboFlux .
Each script file in scripts directory reproduces the results of the experiments of Figure 6 ~ Figure 16 in the paper.
exp1_lsbench_tree_queries.sh: Figure 6
exp2_lsbench_graph_queries.sh: Figure 7
exp3_varying_insertion_rate.sh: Figure 8
exp4_varying_dataset_size.sh: Figure 9
exp5_subgraph_isomorphism.sh: Figure 10
exp6_varying_deletion_rate.sh: Figure 11
exp7_comparision_with_incisomat.sh: Figure 12
exp8_netflow_queries.sh: Figure 13 and 14
exp9_netflow_queries_from_EDBT15.sh: Figure 15 and 16
Files in scripts/timeout/<dataset>/<system> directory show the queries that the system <system> reached two hours timeout in the <dataset> dataset.
As we stated in the paper, we excluded the queries that timed out on one of our competitors.
In the experiments using Netflow (Figure 13 and 14), since some of the queries are extremely non-selective, even TurboFlux reached timeout for the queries in scripts/timeout/netflow/TurboFlux directory.
I reported this when we submitted the first version, and this was reviewed by the SIGMOD reviewers; this was omitted during adding additional experimental results using Netflow queries in
EDBT15[1] (Figure 15 and 16) in the revised version. (the following paragraph is excerpted from the original submission of the paper.
After executing scripts/<exp_name>.sh, the results will be stored in experiments/<exp_name> directory.
Files in experiments/<exp_name> directory: log files showing the matching results and some performance related measurements. Please see the output format section below for the details.
Files in experiments/<exp_name>/average directory: average elapsed time for incremental subgraph matching (cost($M(\Delta g, q)$)) (in milliseconds) and intermediate result size (in bytes) for each query size, insertion/deletion rate, or dataset size.
cost($M(\Delta g, q)$): since each update operation is processed very fast and frequently, the exact query processing time is difficult to measure directly. Therefore, we measure the elapsed times for processing the graph update stream with and without query processing, and then calculate the difference as the query processing time.
intermediate result size: the size of the DCG ($g_DCG$). We traverse and calculate the size of the DCG by counting the number of DCG edges and then multiplying by the size of each DCG edge.
Test your own query and data
Run programs/turboflux [OPTIONS] -q <query_graph_file> -d <initial_data_graph_file> -s <graph_update_stream_file> -c <num_operations> in bash shell.
The detailed configurations are as follow.
[OPTIONS]: three options to control the program behavior (defaults: off)
-u: update data graph only without query processing. We can measure the elapsed time for data graph update using this option.
-i: use subgraph isomorphism semantics instead of graph homomorphism.
-p: prune duplicates in graph update stream. This option only affects data updates and does not affect query processing. This option was used in the experiments using Netflow dataset in the paper (Figure 13-16).
-q <query_graph_file>: the path to the query graph file.
-d <initial_data_graph_file>: the path to the initial data graph file.
-s <graph_update_stream_file>: the path to the graph update stream file.
-c <num_operations>: the number of update operations to process, starting from the first line of <graph_update_stream_file>.
Input/Output Format
Input format
Running TurboFlux requires files for 1) query graph, 2) initial data graph, and 3) graph update stream.
Query graph
The first line is the header and should be t # s <id> where <id> is an arbitrary integer which is no smaller than 0.
Next N lines represent N query vertices sorted by query vertex id. Each line should be v <id> <label> <dvid>, where <id> is the query vertex id starting from 0, <label> is the query vertex label, and <dvid> is the data vertex id of a data vertex bound to this query vertex.
The default value of <label> and <dvid> is -1.
Next M lines represent M query edges. Each line should be e <id1> <id2> <label>, representing a directed, labeled edge from query vertex with id <id1> to the query vertex with id <id2> whose label is <label>.
The query graph files used in the paper's experiments are in querysets directory.
Files in querysets/lsbench/x1/<type>_<size> directory: Figure 6-12
Files in querysets/lsbench/x10 or x100/T_6 directory: Figure 8
Files in querysets/netflow/SIGMOD18/<type>_<size> directory: Figure 13 and 14
Files in querysets/netflow/EDBT15/<type>_<size> directory: Figure 15 and 16
Initial data graph
The first line is the header and should be t # <id> where <id> is an arbitrary integer which is no smaller than 0.
Next N lines represent N data vertices in increasing order of data vertex id. Each line should be v <id> <label1> <label2> … where <id> is the data vertex id and <label1>, <label2>, … are data vertex labels (i.e., a data vertex can have multiple labels). The default value of this set of labels is 0.
Next M lines represent M data edges and have the same format as the query edges.
The initial data graph files used in the paper's experiments are in datasets directory.
datasets/lsbench/x1/lsbench.initial : Figure 6-12
datasets/lsbench/x10 or x100/lsbench.initial : Figure 8
datasets/netflow/netflow.initial : Figure 13 and 14
This file contains 2,686,910 triples only since we removed the duplicates from the original dataset containing 16,668,683 triples.
The file before deduplication is attached as datasets/netflow_original/netflow.initial. We used the deduplicated file for convenience.
datasets/netflow_EDBT15/netflow.initial : Figure 15 and 16
In EDBT15[1], each distinct IP address is mapped to a data vertex unlike each distinct pair (IP address, port) to a data vertex
This file contains 2,031,490 triples only since we removed the duplicates from the original dataset containing 16,668,683 triples.
The file before deduplication is attached as datasets/netflow_EDBT15_original/netflow.initial. We used the deduplicated file for convenience.
Graph update stream
Each line represents an ordered update operation to the data graph.
If a line has format v <id> <label>, it represents a vertex label insertion (inserting <label> to data vertex <id>).
If a line has format e <id1> <id2> <label>, it represents an edge insertion (an edge from <id1> to <id2> whose label is <label>).
If there’s a negative sign in front of <id> or <id1> and <id2>, the line represents a deletion. However, be careful that -<id> represents a deletion for <id>-1, not <id> itself. For example, v -3 4 means to delete vertex label 4 from vertex with id 2, and e -4 -2 5 means to delete an edge from vertex 3 to 1 whose label is 5.
The graph update stream files used in the paper's experiments are in datasets directory.
datasets/lsbench/x1/lsbench.stream.insertion : Figure 6-10, and 12
datasets/lsbench/x1/lsbench.stream.deletion.[2|4|6|8|10] : Figure 11
datasets/lsbench/x10 or x100/lsbench.stream.insertion : Figure 8
datasets/netflow/netflow.stream.insertion : Figure 13 and 14
datasets/netflow_EDBT15/netflow.stream.insertion : Figure 15 and 16
Output format
The following shows the output format.
(with -u option)
finishing loading the initial data graph ($g_0$) and reading graph update stream ($\Delta g$)
<start>
processing the graph update stream
<end>
time for processing the graph update stream (ms) : <time>
(without -u option)
finishing loading the initial data graph ($g_0$) and reading graph update stream ($\Delta g$)
time for building the initial DCG ($g_DCG$) (ms) : <time>
<start>
processing the graph update stream
<end>
# positive negative matches : <num>
time for processing the graph update stream (ms) : <time>
time for reorganizing the data graph by lazy reorganization (ms) : <time> (this must be excluded from pure incremental processing cost)
size of the final DCG ($g_DCG$) (bytes, calculated by traversing the DCG) : <size>
<time> in time for building the initial DCG ($g_DCG$) : <time> shows the elapsed time for building the DCG for initial data graph in milliseconds.
<num> in # positive/negative matches : <num> shows the total number of positive and negative matches.
<time> in time for processing the graph update stream (ms) : <time> shows the elapsed time for processing the graph update stream in milliseconds.
<time> in time for reorganizing the data graph by lazy reorganization(ms) : <time> shows the elapsed time for reorganizing the data graph in milliseconds.
In order to reduce the data graph update cost, we lazily reorganize only accessed parts of the data graph during the query time.
Therefore, this time is excluded from the output time for processing the graph update stream.
<size> in size of the final DCG ($g_DCG$) (bytes, calculated by traversing the DCG) : <size> shows the size of the DCG in bytes after applying all updates in the graph update stream.
If you have any inquiry or bug report, please send emails to me at kmkim@dblab.postech.ac.kr.
[1] Sutanay Choudhury, Lawrence B. Holder, George Chin Jr., Khushbu Agarwal, John Feo: A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs. EDBT 2015: 157-168