Data Prep

Data Format

Clustering is a type of unsupervised machine learning that involves grouping together similar data points based on some measure of similarity or distance. The clustering process requires data to be represented in a numerical format because it relies on mathematical operations to measure the similarity between data points. The clustering algorithms are designed to identify patterns in the data and group together data points that are similar to each other, based on the features or attributes of the data.

The clustering algorithms typically require numerical data because they use distance metrics to measure the similarity or dissimilarity between data points. These distance metrics can only be applied to numerical data, and cannot be applied to categorical or text data directly. Therefore, clustering requires only unlabeled numeric data, which can be analyzed using various distance metrics such as Euclidean distance, Manhattan distance, or cosine similarity.

Sample Data used for Clustering

In order to answer the below question,

Predict the most popular Pickup and Drop points and the total amount using clustering.

Only selected columns are picked to perform this analysis. PU_Borough is the label and is used to verify the results of the clustering.

Below image depict the sample dataset from the main dataset.

Below image depict the dataset used for clustering without labels.

Sample Data can be found here.

Since the dataset is very huge, a sample from this dataset is taken for clustering and it can be found here.

Page updated

Report abuse