User Manual

Date Format

The input dataset needs to be preprocessed into a text file with the following format:

  • 2 integers at the first line: the number n of points and the dimensionality d.

  • d + 1 integers in every subsequent line: the point's id, and its coordinates, where the id is in the range [1, n], and each coordinate is an integer in the range of [0, 10^5].

In total, there are n + 1 lines where the numbers at each line are separated by a space.

For example, the first 6 lines of a sample dataset 2D_Visula.ds are:

1000 2

1 10133 92066

2 7133 92600

3 8666 91666

4 7333 90266

5 7933 92000


User Manual for Version 2.0

Options

-algo {integer} the algorithm you choose

-n {integer} the number of points

-d {integer} the dimensionality of points

-ds {string} the file path of the dataset

-rf {string} the folder path of the clustering result

-r {double} the radius epsilon

-k {integer} the core point threshold minPts

-rho {double} the approximation ratio for approximate DBSCAN

-wqr {0,1} the flag indicating whether to output the cluster result to a file (if so, the file is written to the folder specified by -rf). If the flag is 0 (default), then no output.

-vd {0,1} the flag is useful only when the program is used to generate synthetic data. Value 1 indicates that the clusters created should have varying densities, whereas 0 (default) means analogous densities for all clusters.

Algorithm Commands

1. Run our exact algorithm

Parameter list: -algo 1 -r -k -ds [-rf] [-wqr]

2. Run our approximate algorithm

Parameter list: -algo 2 -r -k -rho -ds [-rf] [-wqr]

3. Run the original DBSCAN algorithm with an R-tree

Parameter list: -algo 3 -r -k -ds [-rf] [-wqr]

4. Run the GriDBSCAN algorithm

Parameter list: -algo 4 -r -k -ds [-rf] [-wqr]

5. Run our 2D "wave front" algorithm

Parameter list: -algo 5 -r -k -ds [-rf] [-wqr]

6. Run our 2D "Delaunay triangulation" algorithm

Parameter list: -algo 6 -r -k -ds [-rf] [-wqr]

7. Run Gunawan's 2D algorithm

Parameter list: -algo 7 -r -k -ds [-rf] [-wqr]



8. Generate a synthetic "sead spreader" dataset.

Parameter list: -algo 0 -ds -n -d -vd

Let us take “2D_Visual.ds” as an example. Suppose that we want to do clustering with eps = 5000 and minPts = 20. First, ensure that you have the "write permission" on the folder of the program and its subfolders.

To run our exact algorithm without outputting the clustering result to a file:

./DBSCAN -algo 1 -r 5000 -k 20 -ds “./2D_Visual.ds”

To run our exact algorithm and output the clustering result to “./2D_Visual_minPts=20_eps=5000.0_OurExact”:

./DBSCAN -algo 1 -r 5000 -k 20 -ds “./2D_Visual.ds” -rf “./2D_Visual” -wqr 1

Note that the string "_minPts=20_eps=5000.0_OurExact” will be automatically appended.

To run our approximate algorithm with rho = 0.001 and output the result to “./2D_Visual_minPts=20_eps=5000.0_OurApprox”

./DBSCAN -algo 2 -r 5000 -k 20 -rho 0.001 -ds “./2D_Visual.ds” -rf “./2D_Visual” -wqr 1

Output Format

  • The first line has two integers: denote the first as Num and the second as clNum . If there are noise points, then Num = clNum + 1; otherwise, Num = clNum . The value clNum equals the number of clusters.

  • The subsequent lines are divided into Num “blocks”. Each of the first clNum blocks corresponds to a cluster, whereas the last, (clNum + 1)-st, block (if exists) contains the noise points.

    • In each block, the first line is the number of points in the cluster (or the number of noise points). Denote this value by clSize.

    • The rest of the block has clSize numbers, each of which is the ID of a point in the cluster (or a noise point).

For example, consider the output file below:

5 4

158

747 748 749 ...

106

275 276 277 ...

378

377 378 379 ...

274

1 7 8 ...

84

2 3 4 ...

This file shows that there are 4 clusters whose sizes are respectively 158, 106, 378 and 274, and 84 noise points. The first three point IDs in Cluster 1 are 747, 748 and 749.