Outliers mining in large data sets

* Block Nested Loop algorithm and Local Outlier Factor algorithm, flowchart and implementation in C#

// in progress


Outlier detection has recently become an important problem in many industrial and financial applications. Data objects which differ significantly from the remaining data objects are referred to as outliers. Outlier detection is concerned with discovering exceptional behaviors of objects. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, instrument error or simply human error. Their detection can identify system faults, fraud and can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. My program implement over two algorithms, the first one is for finding out distance-based outliers based on nested loops along outliers. That is based on the distance of a point from its ”nearest neighbor” and rank each point on the basis of its distance to it’s nearest neighbor and declare the top points in this ranking to be outliers. The second algorithm is for detecting density-based local outliers by Local Outlier Factor algorithm. It is local in that the degree depends on how isolated the object is with respect to the surrounding neighborhood. "System Outliers Mining" contains both algorithms. Testing demonstrate that LOF can be used to find outliers which appear to be meaningful, but can otherwise not be identified with existing approaches. Density-based algorithm is more powerful than the distance-based scheme when a dataset contains diverse characteristics.

download program:


download article:

LOF Identifying Density-Based Local Outliers.pdf

Parallel Alg For Distance And Density-based Outliers.pdf

LSC(and LOF)-Mine: Algorithm for Mining Local Outliers.pdf

Mining DistanceBased Outliers in Near Linear Time with Randomization and a Simple Pruning Rule.pdf

All-Nearest-Neighbors Queries in Spatial Databases.pdf

Novelty Detection for Robot Neotaxis.pdf

A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.pdf