A Comparative Study of Various Clustering Algorithms
This article involves the use of different cluster analysis algorithms, with the target of clustering/grouping the senators using their votes as features for 15 different bills in the 114th United States Congress. Finally, the performance of the algorithms in clustering will be compared. I used the following data and was inspired by the studies here.
Wikipedia’s definition for cluster analysis is: “Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups.”
Another (more machine learning-wise) definition is: “Clustering or cluster analysis is an unsupervised learning problem.” This technique is most often used to understand the interesting patterns in data, such as groups of people based on their behavior.
Cluster analysis can be achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them. Cluster analysis is not an automatic task, but an iterative process of optimization that involves trial and failure. It is often necessary to modify data pre-processing and model parameters until the result achieves the desired properties.
There are many different algorithms for cluster analysis — some of them going back 1930s, so it may be beneficial to try different clustering algorithms in machine learning and compare their efficiency in this specific clustering task. In this paper, I will introduce the results of the following algorithms:
· K Means Clustering,
· Hierarchical Clustering,
· Gaussian Mixture Models.
I tried to concentrate on the following issues for the selected dataset and algorithms:
- How many clusters (parties) are there?
- Was the algorithm successful in detecting the clusters?
- Comparison of the performances of these algorithms.
Visual exploratory analysis of the results using Pandas and Seaborn give a better picture about the efficiency of clustering algorithms.
The first step is to import and clean the data (if needed) using pandas before starting the cluster analysis.
According to the dataset, there were 54 Republicans, 44 Democrats and 2 Independents.
There are 100 rows (senators) and 18 columns (attributes) of data. 15 columns cover the votes of the senators for the bills and the first three columns are the name, party and state of the senator. Our target is to estimate the party, so the first three columns will be dropped.
The dataset is clean (there are no NaNs, Dtype are correct), so we will directly start by checking the Hopkins statistic in order to have an idea about the cluster tendency of the dataset. The rule of the thumb is that if the Hopkins statistic is above 0.5 than the dataset is not clusterable. If the value of H is close to zero, then we can reject null hypothesis and deduce that dataset contains meaningful clusters.
Applying the K-Means algorithm without scaling showed that the algorithm made a very precise prediction for the Republicans (and clustered all of them correctly). For the Democrats, most probably a few senators (around 3) did not vote according to the Party decisions and the algorithm assumed them as small cluster.
A crucial stage for any unsupervised algorithm will be to determine the optimal number of clusters into which the data may be clustered. Elbow Method is one of the most popular methods to determine the optimal value of k. Always considering the target, we are trying to figure out the number of clusters -number of clusters of senators who are voting as a group. According to the graph below, there are two clusters/groups, which was quite expected.
Elbow Method — Yellowbrick Visualizing
There are other codes used for estimating the number of clusters; Yellowbrick is the most popular. Yellowbrick gave the same result for the number of clusters, which is 2.
I run the Yellowbrick algorithm once again; this time using a possible cluster range between 2 and 10. Interestingly, Yellowbrick’s advice was to use 4 clusters.
The results say that (when the k-search range is between 2 and 10), there are 4 clusters; although there are 2 parties and 2 independent senators. As I mentioned before, most probably, some senators do not always like taking orders while voting (party-line voting). Increasing the k size may give us a feeling about the sub-divisions in the parties (or in just one party).
Checking for 4 clusters (parties/groups) tell us that, Republican party senators are still a tight cluster, but Democrat party senators tend to vote as one big group and two smaller clusters.
Silhouette is a method of interpretation and validation of consistency within clusters of data. Silhouette value is a measure of how similar an object is to its own cluster (intra-cluster distance) compared to other clusters (inter-cluster distance). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to the neighboring clusters.
Well, here comes the interesting part. Silhouette score for two clusters/parties is 0.73 and 0.82 for four clusters/parties. Remembering the definition of the Silhouette score again, a higher value means that four clusters may be considered as a fact; which again reminds us the possibility of the voting of the Democrats again party line.
K-Means Clustering — Scaled
I used the scaled set in order to compare the results with the unscaled data. The results were quite similar. Most often than not, the literature says that, for K-Means algorithms, scaling provides (most often than not) more precise results. In cases of combining data of similar types, scaling may not be necessary; which is most probably our case.
Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of clusters. It is an unsupervised clustering algorithm and tries to group similar objects (in our case, senators) into groups called clusters.
Once the above coding is used to run the algorithm, it is possible to get quite a visual diagram, which is called a dendrogram. Two different approaches, “Ward” and “complete”, were used in order to observe their effect on the clustering.
According to the dendrograms, 2 clusters were selected as the possible number of clusters when “complete” was used as the linkage parameter in the algorithm and 4 clusters were selected when “ward” was used (“ward” is also the default value of the linkage parameter).
Remembering that a higher Silhouette Score value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters, two clusters/groups/party groups came out as the more probable result from the Hierarchical Clustering algorithm.
This result does not match with what the K-Means Clustering algorithm produced. The interpretation of the Hierarchical Clustering results may imply (opposite of the K-Means Algorithm results) that, there are two densely packed clusters (party groups) — not four.
Gaussian Mixture Models
As the final algorithm, Gaussian Mixture Model (GMM) was used. GMMs assume that there are a certain number of Gaussian distributions, and each of these distributions represent a cluster. Hence, a Gaussian Mixture Model tends to group the data points belonging to a single distribution together. GMMs are probabilistic models and use the soft clustering approach for distributing the points in different clusters. Unlike K-Means, Gaussian mixture models account for variance and returns the probability that a data point belongs to each of the K clusters.
It may not be the subject of this article, but there is a distinction between probability and likelihood: Probability attaches to possible results; likelihood attaches to hypotheses. In this analysis, the hypothesis was, the dataset is probably clustered into 2–4 groups and log-likelihood values were used to find out the stronger possibility.
Converged log-likelihood values and number of iterations needed for the model to converge for 2 and 4 clusters are shown on the left and running the GMM algorithm for 2 and 4 possible clusters, the distribution of senators are shown on the right.
GMM results are closer to Hierarchical Clustering Analysis results, which imply that two clusters are more probable than four clusters.
In this article, I used three different Unsupervised Machine Learning (clustering) algorithms with the purpose of predicting the party membership of the senators according to their votes for 15 bills in the 114th US Congress. The analysis provides evidence that:
- Hopkins statistic showed that, the data was not uniformly distributed; on the contrary there were clusters.
- All three algorithms provided similar results showing that Republican Senators came out as a tight cluster, which means that their voting was quite similar.
- Yellowbrick results implied a possible four clusters instead of two.
- On the other hand, checking for more than 2 clusters proved that, for some bills, some Democrats had the tendency to vote with the opposing party, which most probably does not mean two more clusters. A reasonable explanation will be, not following party-line voting, but voting with the other group only for a few bills.
- Scaling did not improve the results significantly for the K-Means Clustering.
- Silhouette scores for K-Means Clustering and Hierarchical Clustering methods did not match.
You can access to this article and similar ones here.