Clustering Algorithms

HDBSCAN Clustering

Visualisation of HDBSCAN clustering

Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a clustering algorithm that groups data points based on density, identifying clusters of varying shapes and sizes while effectively handling noise and outliers.

Hyperparameters tuned:

min_cluster_size: The minimum number of points needed to form a cluster. Affects the granularity of clustering.
min_samples: Determines how conservative the clustering is by influencing the definition of core points. Higher values lead to stricter cluster formation.
- cluster_selection_epsilon: Sets a maximum distance threshold for clusters; mainly used in constraint-based clustering.

HDBSCAN Steps Completed:

Data was scaled using Standard Scaler
UMAP applied to reduce dimensionality from 768 to 50
HDBSCAN Clustering completed on 2,293,146 embeddings:
- Execution time: 3,769 Seconds (A little over an hour on 2 GPUs, 8 CPU's,128GB, L40 )
- 101 Clusters were defined
- Noise Points: 218,660 (9.54%)
UMAP from 50 to 3 dimensions for visualization in Plotly

K-means Clustering

K-means Clustering Visualization

We clustered unlabeled audio embeddings into 20 clusters using k-means clustering and visualized it after U-map projection to 2 dimensions.

How it works

Initialization: We initialize k centroids with k++ iniitalization

Iterative Update: We first update each embedding's association based on nearest centroid. Then we update each cluster's centroid to the mean of the cluster. The process is repeated until convergence.

Hyperparameters: Initialization, k (number of clusters)

K-means Steps Completed:

Hyperparameters: We used Elbow method to choose the optimal k for the clustering. For initialization, we tried normal random initialization and k++ initialization and observed that the latter is better.
We tried k-means clustering regular algorithm and mini-batch version. Mini-batch implementation significantly reduced training time.
The data was prepared reducing from high embedding (768 to 50) using UMAP for clustering.
- Clustering in original dimension led to imbalanced clusters, with a few of them having less than 5 points in a cluster
The final visualization of the clusters was done by reducing to 2D using the UMAP.

Spectrograms

A spectrogram is like a "frequency map" of sound over time and helps in analyzing and understanding audio signals. It helps to visualize what frequencies are present and how they change.We are using it for pattern recognition by analyzing spectrogram corresponding to the embedding in a cluster.

Example: bird calls, animal sounds etc.

We select three random embeddings from cluster 3 and plot their spectrogram

The spectrograms show similar pattern even though they come from different file

Time of the Day Analysis

To perform the analysis timestamps were converted to local Peru time (UTC -05:00)

Time of day breakdown was based on the following assumptions regarding Peru in mid-June:

Sunrise: 06:15 - 06:30 AM
Sunset: 17:45 - 18:00 PM

Time of Day Distribution

5-6 Dawn
6-10 Morning
10-14 Midday
14-17 Afternoon
17-18 Dusk
18-4 Night

Page updated

Google Sites

Report abuse