Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) is a clustering algorithm that groups data points based on density, identifying clusters of varying shapes and sizes while effectively handling noise and outliers.
Hyperparameters tuned:
min_cluster_size: The minimum number of points needed to form a cluster. Affects the granularity of clustering.
min_samples: Determines how conservative the clustering is by influencing the definition of core points. Higher values lead to stricter cluster formation.
cluster_selection_epsilon: Sets a maximum distance threshold for clusters; mainly used in constraint-based clustering.
HDBSCAN Steps Completed:
Data was scaled using Standard Scaler
UMAP applied to reduce dimensionality from 768 to 50
HDBSCAN Clustering completed on 2,293,146 embeddings:
Execution time: 3,769 Seconds (A little over an hour on 2 GPUs, 8 CPU's,128GB, L40 )
101 Clusters were defined
Noise Points: 218,660 (9.54%)
UMAP from 50 to 3 dimensions for visualization in Plotly
We clustered unlabeled audio embeddings into 20 clusters using k-means clustering and visualized it after U-map projection to 2 dimensions.
Initialization: We initialize k centroids with k++ iniitalization
Iterative Update: We first update each embedding's association based on nearest centroid. Then we update each cluster's centroid to the mean of the cluster. The process is repeated until convergence.
Hyperparameters: Initialization, k (number of clusters)
K-means Steps Completed:
Hyperparameters: We used Elbow method to choose the optimal k for the clustering. For initialization, we tried normal random initialization and k++ initialization and observed that the latter is better.
We tried k-means clustering regular algorithm and mini-batch version. Mini-batch implementation significantly reduced training time.
The data was prepared reducing from high embedding (768 to 50) using UMAP for clustering.
Clustering in original dimension led to imbalanced clusters, with a few of them having less than 5 points in a cluster
The final visualization of the clusters was done by reducing to 2D using the UMAP.
We select three random embeddings from cluster 3 and plot their spectrogram
The spectrograms show similar pattern even though they come from different file
To perform the analysis timestamps were converted to local Peru time (UTC -05:00)
Time of day breakdown was based on the following assumptions regarding Peru in mid-June:
Sunrise: 06:15 - 06:30 AM
Sunset: 17:45 - 18:00 PM
5-6 Dawn
6-10 Morning
10-14 Midday
14-17 Afternoon
17-18 Dusk
18-4 Night