Clustering is one of the most common exploratory data analysis techniques used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different. In other words, we try to find homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as euclidean-based distance or correlation-based distance. The decision of which similarity measure to use is application-specific.
Clustering analysis can be done on the basis of features where we try to find subgroups of samples based on features or on the basis of samples where we try to find subgroups of features based on samples. We’ll cover here clustering based on features. Clustering is used in market segmentation; where we try to find customers that are similar to each other whether in terms of behaviors or attributes, image segmentation/compression; where we try to group similar regions together, document clustering based on topics, etc.
Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.
In this post, we will cover only Kmeans which is considered as one of the most used clustering algorithms due to its simplicity.
What is K-means Clustering?
The term “k-means” was first used by james macqueen in 1967 as part of his paper on “some methods for classification and analysis of multivariate observations”. the standard algorithm was also used in bell labs as part of a technique in pulse code modulation in 1957. it was also published by in 1965 by e. w. forgy and typically is also known as the lloyd-forgy method.
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K
The results of the K-means clustering algorithm are:
The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)
Where k-means clustering algorithm is used?
k-mean clustering algorithm is used in Machine Learning models where we have to do unsupervised learning with improper historical data , so for that case we use k-means clustering algorithm.
What are the basic steps for K-means clustering?
- Step 1: Choose the number of clusters k.
- Step 2: Select k random points from the data as centroids.
- Step 3: Assign all the points to the closest cluster centroid.
- Step 4: Re-compute the centroids of newly formed clusters.
- Step 5: Repeat steps 3 and 4.
How does k-means clustering work?
The k-means clustering algorithm attempts to split a given anonymous data set (a set containing no information as to class identity) into a fixed number (k) of clusters.
Initially k number of so called centroids are chosen. A centroid is a data point (imaginary or real) at the center of a cluster. In Praat each centroid is an existing data point in the given input data set, picked at random, such that all centroids are unique (that is, for all centroids ci and cj, ci ≠ cj). These centroids are used to train a KNN Classifier. The resulting classifier is used to classify (using k = 1) the data and thereby produce an initial randomized set of clusters. Each centroid is thereafter set to the arithmetic mean of the cluster it defines. The process of classification and centroid adjustment is repeated until the values of the centroids stabilize. The final centroids will be used to produce the final classification/clustering of the input data, effectively turning the set of initially anonymous data points into a set of data points, each with a class identity.
Κ-means clustering algorithm inputs are the number of clusters Κ and the data set. The algorithm starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps:
1. Data assignment step:
Each centroid defines one of the clusters. In this step, each data point based on the squared Euclidean distance is assigned to its nearest centroid. If 𝑐𝑖ci is the collection of centroids in set C, then each data point x is assigned to a cluster based on
where dist( · ) is the standard (L2) Euclidean distance.
2. Centroid update step:
Centroids are recomputed by taking the mean of all data points assigned to that centroid’s cluster.
The algorithm iterates between step one and two until a stopping criteria is met (no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached).
This algorithm may converge on a local optimum. Assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.
If the true label is not known in advance, then K-Means clustering can be evaluated using Elbow Criterion , Silhouette Coefficient , cross-validation, information criteria, the information theoretic jump method, and the G-means algorithm. .
Elbow Criterion Method:
The idea behind elbow method is to run k-means clustering on a given dataset for a range of values of k (e.g k=1 to 10), for each value of k, calculate sum of squared errors (SSE).
Calculate the mean distance between data points and their cluster centroid. Increasing the number of clusters(K) will always reduce the distance to data points, thus decrease this metric, to the extreme of reaching zero when K is as same as the number of data points. So the goal is to choose a small value of k that still has a low SSE.
We run the algorithm for different values of K(say K = 10 to 1) and plot the K values against SSE(Sum of Squared Errors). And select the value of K for the elbow point.
Silhouette Coefficient Method:
A higher Silhouette Coefficient score relates to a model with better-defined clusters. The Silhouette Coefficient is defined for each sample and is composed of two scores:
- The mean distance between a sample and all other points in the same class.
- The mean distance between a sample and all other points in the next nearest cluster.
The Silhouette Coefficient is for a single sample is then given as:
To find the optimal value of k for KMeans, loop through 1..n for n_clusters in KMeans and calculate Silhouette Coefficient for each sample.
A higher Silhouette Coefficient indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.
K-Means algorithm uses Eucledean Distance, other popular distance metrics in Machine Learning are:
- Cosine distance : It determines the cosine of the angle between the point vectors of the two points in the n dimensional space. Closer the point vectors are by angle, the higher is the Cosine Similarity
- Manhattan distance : is the total sum of the difference between the x-coordinates and y-coordinates.
Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values. Various distance measures, or norms, are possible:
- Computing the root of a sum of squares (RMSE) corresponds to the Euclidian norm: it is the notion of distance you are familiar with. It is also called the ℓ2 norm(…)
- Computing the sum of absolutes (MAE) corresponds to the ℓ1 norm,(…). It is sometimes called the Manhattan norm because it measures the distance between two points in a city if you can only travel along orthogonal city blocks.
1. document classification
cluster documents in multiple categories based on tags, topics, and the content of the document. this is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. the initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarity in document groups.
2. delivery store optimization
optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch locations and a genetic algorithm to solve the truck route as a traveling salesman problem.
3. identifying crime localities
with data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.
4. fantasy league stat analysis
analyzing player stats has always been a critical element of the sporting world, and with increasing competition, machine learning has a critical role to play here. as an interesting exercise, if you would like to create a fantasy draft team and like to identify similar players based on player stats, k-means can be a useful option.
5. insurance fraud detection
machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.
6. customer segmentation
clustering helps marketers improve their customer base, work on target areas, and segment customers based on purchase history, interests, or activity monitoring. here is a white paper on how telecom providers can cluster pre-paid customers to identify patterns in terms of money spent in recharging, sending sms, and browsing the internet. the classification would help the company target specific clusters of customers for specific campaigns.
7. rideshare data analysis
the publicly available uber ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. analyzing this data is useful not just in the context of uber but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.
8. cyber-profiling criminals
cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.
9. automatic clustering of it alerts
large enterprise it infrastructure technology components such as network, storage, or database generate large volumes of alert messages. because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. clustering of data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.
9. call record detail analysis
a call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. this information provides greater insights about the customer’s needs when used with customer demographics. in this article , you will understand how you can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. it is used to understand segments of customers with respect to their usage by hours.
1. Spam filter
Do you know the junk folder in your email inbox? It is the place where emails have been identified as spam by the algorithm.
Many machine learning courses, such as Andrew Ng’s famed Coursera course, use the spam filter as an example of unsupervised learning and clustering.
What the problem is: Spam emails are at best an annoying part of modern-day marketing techniques, and at worst, an example of people phishing for your personal data. To avoid getting these emails in your main inbox, email companies use algorithms. The purpose of these algorithms is to flag an email as spam correctly or not.
How clustering works: K-Means clustering techniques have proven to be an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped.
These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%. This is excellent news for people who want to be sure they’re not missing out on your favorite newsletters and offers.
2. Classifying network traffic
Imagine you want to understand the different types of traffic coming to your website. You are particularly interested in understanding which traffic is spam or coming from bots.
What the problem is: As more and more services begin to use APIs on your application, or as your website grows, it is important you know where the traffic is coming from. For example, you want to be able to block harmful traffic and double down on areas driving growth. However, it is hard to know which is which when it comes to classifying the traffic.
How clustering works: K-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic types. The process is faster and more accurate than the previous Autoclass method. By having precise information on traffic sources, you are able to grow your site and plan capacity effectively.
3. Fantasy Football and Sports
Ok so up until this point we have looked into different business problems and how clustering algorithms have been applied to solve them. But now for the critical issues — fantasy football!
What is the problem: Whom should you have in your team? Which players are going to perform best for your team and allow you to beat the competition? The challenge at the start of the season is that there is very little if any data available to help you identify the winning players.
How clustering works: When there is little performance data available to train your model on, you have an advantage for unsupervised learning. In this type of machine learning problem, you can find similar players using some of their characteristics. This has been done using K-Means clustering. Ultimately this means you can get a better team more quickly at the start of the year, giving you an advantage.
Kmeans algorithm is good in capturing structure of the data if clusters have a spherical-like shape. It always try to construct a nice spherical shape around the centroid. That means, the minute the clusters have a complicated geometric shapes, kmeans does a poor job in clustering the data. We’ll illustrate three cases where kmeans will not perform well.
First, kmeans algorithm doesn’t let data points that are far-away from each other share the same cluster even though they obviously belong to the same cluster. Below is an example of data points on two different horizontal lines that illustrates how kmeans tries to group half of the data points of each horizontal lines together.
Kmeans clustering is one of the most popular clustering algorithms and usually the first thing practitioners apply when solving clustering tasks to get an idea of the structure of the dataset. The goal of k means is to group data points into distinct non-overlapping subgroups. It does a very good job when the clusters have a kind of spherical shape. However, it suffers as the geometric shapes of clusters deviate from spherical shapes. Moreover, it also doesn’t learn the number of clusters from the data and requires it to be pre-defined. To be a good practitioner, it’s good to know the assumptions behind algorithms/methods so that you would have a pretty good idea about the strength and weaknesses of each method. This will help you decide when to use each method and under what circumstances. This post, covered both strengths, weaknesses, and some evaluation methods related to kmeans.
Below are the main takeaways:
- Scale/standardize the data when applying k means algorithm.
- Elbow method in selecting number of clusters doesn’t usually work because the error function is monotonically decreasing for all ks.
- K means gives more weight to the bigger clusters.
- Kmeans assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn’t work well when clusters are in different shapes such as elliptical clusters.
- If there is overlapping between clusters, k means doesn’t have an intrinsic measure for uncertainty for the examples that belong to the overlapping region in order to determine for which cluster to assign each data point.
- Kmeans may still cluster the data even if it can’t be clustered such as data that comes from uniform distributions.
Thank you! keep learning! keep growing! keep sharing!