K-Means is an interesting, simple and fairly intuitive algorithm. It turns out that it can do more than just clusters.

This is a pretty small post, but interesting.

K-Means is an elegant algorithm. It ‘s easy to understand (make random points, move them ibecome some of the existing clusters) and works well in practice. When I first found out about it, I remember being fascinated. It was stylish. But then over time, interest disappeared, I noticed numerous limitations, including spherical cluster priority, its linearity, and particularly annoying observations in EDA scenarios, the fact that it doesn’t find the optimal number in the clusters itself, so you have to use this parameter as well. And then, a couple of years ago, I learned a few cool tricks about using K-Means. So here it goes.

First, we need to create a starting point. I mostly use breast cancer data, but you can play with any other set of data.

from sklearn.cluster import KMeans
from sklearn.svm import LinearSVC
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import numpy as np
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)svm = LinearSVC(random_state=17)
svm.fit(X_train, y_train)
svm.score(X_test, y_test) # should be ~0.93

So what is this cool trick that piqued my interest in K-Means?

K-Means can be used as a source of new features.

How could you ask? Well, K-Means is a clustering algorithm, right? You can add the terminated cluster as a new categorical property.

Let’s try now.

# imports from the example abovesvm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=3, random_state=17)
X_clusters = kmeans.fit_predict(X_train).reshape(-1, 1)svm.fit(np.hstack([X_train, X_clusters]), y_train)
svm.score(np.hstack([X_test, kmeans.predict(X_test)
.reshape(-1, 1)]), y_test) # should be ~0.937
Source: knowyourmeme.com

These features are categorical, but we can ask the model to produce distances to all centers, resulting in (hopefully) more informative features.

# imports from the example abovesvm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=3, random_state=17)
X_clusters = kmeans.fit_transform(X_train)
# ^^^^^^^^^
# Notice the `transform` instead of `predict`
# Scikit-learn supports this method as early as version 0.15
svm.fit(np.hstack([X_train, X_clusters]), y_train)
svm.score(np.hstack([X_test, kmeans.transform(X_test)]), y_test)
# should be ~0.727

Wait, what’s wrong? Could there be a correlation between existing features and distances to centers?

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
columns = ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension', 'distance to cluster 1', 'distance to cluster 2', 'distance to cluster 3']data = pd.DataFrame.from_records(np.hstack([X_train, X_clusters]), columns=columns)sns.heatmap(data.corr())
Note the last 3 columns, especially the last one, and their color on each row. Source: author of the message, see the code snippet above to create it.

You’ve probably heard that we want the properties of the data set to be as independent as possible. The reason is that many machine learning models assume this independence to be a simpler algorithm. More information on this topic can be found here and hereBut the bottom line is that if useless information is in linear models, it is unstable in the model and in turn it is likely to interfere. I have noticed this problem on several occasions, sometimes even with nonlinear models, and cleaning the data set from the correlating properties tends to increase the performance properties of the model slightly.

Back to our main topic. Given that the new properties have indeed correlated with some of the existing properties, what if we only use distances to the cluster as properties, then do they work?

# imports from the example abovesvm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=3, random_state=17)
X_clusters = kmeans.fit_transform(X_train)
svm.fit(X_clusters, y_train)
svm.score(kmeans.transform(X_test), y_test) # should be ~0.951

Much better. With this example, you can see that we can use K-Means as a way to reduce the dimension. Neat.

So far so good. But the piece of resistance has not yet been shown.

K-Means can be used to replace the kernel trick

You heard me right. For example, you can specify more middle parts to make the K-Means algorithm fit as there are features, much more.

# imports from the example abovesvm = LinearSVC(random_state=17)
kmeans = KMeans(n_clusters=250, random_state=17)
X_clusters = kmeans.fit_transform(X_train)
svm.fit(X_clusters, y_train)
svm.score(kmeans.transform(X_test), y_test) # should be ~0.944

Well, not so good, but pretty decent. In practice, the biggest benefit of this approach is when you have a lot of information. The predictive performance of your mileage may also vary, for example, I had used this method n_clusters=1000 and it worked better than just a few clusters.

SVMs are known to slowly train large data sets. Possibly slow. I’ve been there, done it. Therefore, for example, there are numerous techniques for estimating a kernel trick with much smaller computational resources.

By the way, let’s compare how this K-Means trick works against the classic SVM and some alternative kernel approximation methods.

The code below is inspired these two scikit-learn examples.

import matplotlib.pyplot as pltimport numpy as npfrom time import timefrom sklearn.datasets import load_breast_cancerfrom sklearn.svm import LinearSVC, SVC
from sklearn import pipeline
from sklearn.kernel_approximation import RBFSampler, Nystroem, PolynomialCountSketch
from sklearn.preprocessing import MinMaxScaler, Normalizer
from sklearn.model_selection import train_test_split
from sklearn.cluster import MiniBatchKMeans

mm = pipeline.make_pipeline(MinMaxScaler(), Normalizer())
X, y = load_breast_cancer(return_X_y=True)
X = mm.fit_transform(X)
data_train, data_test, targets_train, targets_test = train_test_split(X, y, random_state=17)

We test the 3 kernel approximation methods available in the scikit-learn package for the K-Means trick, and as a baseline we have a linear SVM and an SVM that uses the kernel trick.

# Create a classifier: a support vector classifier
kernel_svm = SVC(gamma=.2, random_state=17)
linear_svm = LinearSVC(random_state=17)
# create pipeline from kernel approximation and linear svm
feature_map_fourier = RBFSampler(gamma=.2, random_state=17)
feature_map_nystroem = Nystroem(gamma=.2, random_state=17)
feature_map_poly_cm = PolynomialCountSketch(degree=4, random_state=17)
feature_map_kmeans = MiniBatchKMeans(random_state=17)
fourier_approx_svm = pipeline.Pipeline([("feature_map", feature_map_fourier), ("svm", LinearSVC(random_state=17))])nystroem_approx_svm = pipeline.Pipeline([("feature_map", feature_map_nystroem), ("svm", LinearSVC(random_state=17))])poly_cm_approx_svm = pipeline.Pipeline([("feature_map", feature_map_poly_cm), ("svm", LinearSVC(random_state=17))])kmeans_approx_svm = pipeline.Pipeline([("feature_map", feature_map_kmeans), ("svm", LinearSVC(random_state=17))])

Collect the timing and score of each of our lineups.

# fit and predict using linear and kernel svm:
kernel_svm_time = time()
kernel_svm.fit(data_train, targets_train)
kernel_svm_score = kernel_svm.score(data_test, targets_test)
kernel_svm_time = time() - kernel_svm_time
linear_svm_time = time()
linear_svm.fit(data_train, targets_train)
linear_svm_score = linear_svm.score(data_test, targets_test)
linear_svm_time = time() - linear_svm_time
sample_sizes = 30 * np.arange(1, 10)
fourier_scores = []
nystroem_scores = []
poly_cm_scores = []
kmeans_scores = []
fourier_times = []
nystroem_times = []
poly_cm_times = []
kmeans_times = []
for D in sample_sizes:
start = time()
nystroem_approx_svm.fit(data_train, targets_train)
nystroem_times.append(time() - start)
start = time()
fourier_approx_svm.fit(data_train, targets_train)
fourier_times.append(time() - start)
start = time()
poly_cm_approx_svm.fit(data_train, targets_train)
poly_cm_times.append(time() - start)
start = time()
kmeans_approx_svm.fit(data_train, targets_train
kmeans_times.append(time() - start)
fourier_score = fourier_approx_svm.score(data_test, targets_test)
nystroem_score = nystroem_approx_svm.score(data_test, targets_test)
poly_cm_score = poly_cm_approx_svm.score(data_test, targets_test)
kmeans_score = kmeans_approx_svm.score(data_test, targets_test)

Now draw all the collected results.

plt.figure(figsize=(16, 4))
accuracy = plt.subplot(211)
timescale = plt.subplot(212)
accuracy.plot(sample_sizes, nystroem_scores, label="Nystroem approx. kernel")
timescale.plot(sample_sizes, nystroem_times, '--', label='Nystroem approx. kernel')
accuracy.plot(sample_sizes, fourier_scores, label="Fourier approx. kernel")
timescale.plot(sample_sizes, fourier_times, '--', label='Fourier approx. kernel')
accuracy.plot(sample_sizes, poly_cm_scores, label="Polynomial Count-Min approx. kernel")
timescale.plot(sample_sizes, poly_cm_times, '--', label='Polynomial Count-Min approx. kernel')
accuracy.plot(sample_sizes, kmeans_scores, label="K-Means approx. kernel")
timescale.plot(sample_sizes, kmeans_times, '--', label='K-Means approx. kernel')
# horizontal lines for exact rbf and linear kernels:
accuracy.plot([sample_sizes[0], sample_sizes[-1]],
[linear_svm_score, linear_svm_score], label="linear svm")
timescale.plot([sample_sizes[0], sample_sizes[-1]], [linear_svm_time, linear_svm_time], '--', label='linear svm')
accuracy.plot([sample_sizes[0], sample_sizes[-1]],
[kernel_svm_score, kernel_svm_score], label="rbf svm")
timescale.plot([sample_sizes[0], sample_sizes[-1]], [kernel_svm_time, kernel_svm_time], '--', label='rbf svm')

And some more plot adjustments to make it beautiful.

# legends and labels
accuracy.set_title("Classification accuracy")
timescale.set_title("Training times")
accuracy.set_xlim(sample_sizes[0], sample_sizes[-1])
accuracy.set_ylim(np.min(fourier_scores), 1)
timescale.set_xlabel("Sampling steps = transformed feature dimension")
accuracy.set_ylabel("Classification accuracy")
timescale.set_ylabel("Training time in seconds")
Meh. So was everything useless? Source: the author of the message, see the code snippet above to create it.

You know what? Not a shred of. While it’s the slowest, the K-Means RBF core as an approximation is still a good option. I’m serious. You can use this special K-purpose in a scikit-learn MiniBatchKMeans which is one of the few algorithms that support .partial_fit method. Combining this with machine learning, having .partial_fit too, such as a PassiveAggressiveClassifier a rather interesting solution can be created.

Note that beauty .partial_fit is double. First, it allows algorithms to be trained in a non-kernel manner, i.e., more data than can be stored in RAM. Second, depending on the type of problem, if, in principle (basically well-in principle), one would never need to change the model, it could also be trained where it is used. It’s called e-learning, and it’s very interesting. There is something like this what some Chinese companies do and can usually be quite helpful AdTech, because you can get information whenever your ad recommendation was right or wrong, in seconds.

You know, here is a small example of this approach to nuclear education.

from sklearn.cluster import MiniBatchKMeans
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import numpy as np
def batch(iterable, n=1):
# source: https://stackoverflow.com/a/8290508/5428334
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=17)
kmeans = MiniBatchKMeans(n_clusters=100, random_state=17)
# K-Means has a constraint, n_clusters <= n_samples to fit
pac = PassiveAggressiveClassifier(random_state=17)
for x, y in zip(batch(X_train, n=100), batch(y_train, n=100)):
kmeans.partial_fit(x, y) # fit K-Means a bit
x_dist = kmeans.transform(x) # obtain distances
pac.partial_fit(x_dist, y, classes=[0, 1])
# learn a bit the classifier, we need to indicate the classes
print(pac.score(kmeans.transform(X_test), y_test))
# 0.909 after 100 samples
# 0.951 after 200 samples
# 0.951 after 300 samples
# 0.944 after 400 samples
# 0.902 after 426 samples
# VSkmeans = MiniBatchKMeans(n_clusters=100, random_state=17)
pac = PassiveAggressiveClassifier(random_state=17)
pac.fit(kmeans.fit_transform(X_train), y_train)
pac.score(kmeans.transform(X_test), y_test) # should be ~0.951

So you’ve done it to the end. Hopefully now your ML toolkit will be richer. You may have heard the so-called “no lunch” phrase; basically, there is no silver model, in this case for ML problems. Maybe for the next project, the methods presented in this post won’t work, but in a future project, they will work. So just try and see for yourself. And if you need an online learning algorithm / method, there is a greater chance that the K-Means core is roughly the right tool for you.

By the way, now there’s another blog post, also on ML. What’s even nicer, it has, among many other great things, a pretty interesting way to use K-Means. But no spoilers so far. Stay on the channel.

Finally, if you are reading this, thank you! If you want to leave feedback or just have questions, HMU on Twitter or leave a comment.

Some links you may find interesting


Special thanks @dgaponcic style checks and content review and thanks @anisoara_ionela to review the grammar of this article more thoroughly than any artificial intelligence could ever. You are the best ❤

PS I think you noticed all of this random_states in the code. If you’re wondering why I added these, it needs to make the code samples repeatable. Because often tutorials don’t do this and it leaves room for a cherry pick where the author presents only the best results, and when trying to repeat them, the reader is either unable or it takes a lot of time. But know this, you can play with values random_state and get very different results. For example, when you execute a code snippet with the original properties and distances for three middle sections with scores of 0.727, a random seed of 41 instead of 17, you get an accuracy point of 0.944. So yes, random_state or else random seeds are called in the frame of your choice is an important aspect to keep in mind, especially when doing research.


Please enter your comment!
Please enter your name here