User Manual
Date Format
The input dataset needs to be preprocessed into a text file with the following format:
2 integers at the first line: the number n of points and the dimensionality d.
d + 1 integers in every subsequent line: the point's id, and its coordinates, where the id is in the range [1, n], and each coordinate is an integer in the range of [0, 10^5].
In total, there are n + 1 lines where the numbers at each line are separated by a space.
For example, the first 6 lines of a sample dataset 2D_Visula.ds are:
1000 2
1 10133 92066
2 7133 92600
3 8666 91666
4 7333 90266
5 7933 92000
User Manual for Version 2.0
Options
-algo {integer} the algorithm you choose
-n {integer} the number of points
-d {integer} the dimensionality of points
-ds {string} the file path of the dataset
-rf {string} the folder path of the clustering result
-r {double} the radius epsilon
-k {integer} the core point threshold minPts
-rho {double} the approximation ratio for approximate DBSCAN
-wqr {0,1} the flag indicating whether to output the cluster result to a file (if so, the file is written to the folder specified by -rf). If the flag is 0 (default), then no output.
-vd {0,1} the flag is useful only when the program is used to generate synthetic data. Value 1 indicates that the clusters created should have varying densities, whereas 0 (default) means analogous densities for all clusters.
Algorithm Commands
1. Run our exact algorithm
Parameter list: -algo 1 -r -k -ds [-rf] [-wqr]
2. Run our approximate algorithm
Parameter list: -algo 2 -r -k -rho -ds [-rf] [-wqr]
3. Run the original DBSCAN algorithm with an R-tree
Parameter list: -algo 3 -r -k -ds [-rf] [-wqr]
4. Run the GriDBSCAN algorithm
Parameter list: -algo 4 -r -k -ds [-rf] [-wqr]
5. Run our 2D "wave front" algorithm
Parameter list: -algo 5 -r -k -ds [-rf] [-wqr]
6. Run our 2D "Delaunay triangulation" algorithm
Parameter list: -algo 6 -r -k -ds [-rf] [-wqr]
7. Run Gunawan's 2D algorithm
Parameter list: -algo 7 -r -k -ds [-rf] [-wqr]
8. Generate a synthetic "sead spreader" dataset.
Parameter list: -algo 0 -ds -n -d -vd
Let us take “2D_Visual.ds” as an example. Suppose that we want to do clustering with eps = 5000 and minPts = 20. First, ensure that you have the "write permission" on the folder of the program and its subfolders.
To run our exact algorithm without outputting the clustering result to a file:
./DBSCAN -algo 1 -r 5000 -k 20 -ds “./2D_Visual.ds”
To run our exact algorithm and output the clustering result to “./2D_Visual_minPts=20_eps=5000.0_OurExact”:
./DBSCAN -algo 1 -r 5000 -k 20 -ds “./2D_Visual.ds” -rf “./2D_Visual” -wqr 1
Note that the string "_minPts=20_eps=5000.0_OurExact” will be automatically appended.
To run our approximate algorithm with rho = 0.001 and output the result to “./2D_Visual_minPts=20_eps=5000.0_OurApprox”
./DBSCAN -algo 2 -r 5000 -k 20 -rho 0.001 -ds “./2D_Visual.ds” -rf “./2D_Visual” -wqr 1
Output Format
The first line has two integers: denote the first as Num and the second as clNum . If there are noise points, then Num = clNum + 1; otherwise, Num = clNum . The value clNum equals the number of clusters.
The subsequent lines are divided into Num “blocks”. Each of the first clNum blocks corresponds to a cluster, whereas the last, (clNum + 1)-st, block (if exists) contains the noise points.
In each block, the first line is the number of points in the cluster (or the number of noise points). Denote this value by clSize.
The rest of the block has clSize numbers, each of which is the ID of a point in the cluster (or a noise point).
For example, consider the output file below:
5 4
158
747 748 749 ...
106
275 276 277 ...
378
377 378 379 ...
274
1 7 8 ...
84
2 3 4 ...
This file shows that there are 4 clusters whose sizes are respectively 158, 106, 378 and 274, and 84 noise points. The first three point IDs in Cluster 1 are 747, 748 and 749.