Fast k-medoids clustering in Python
This package is a wrapper around the fast Rust k-medoids package, implementing the FasterPAM and FastPAM algorithms along with the baseline k-means-style and PAM algorithms.
All algorithms expect a distance matrix and the number of clusters as input. They hence can be used with arbitrary dissimilarity functions.
If you use this code in scientific work, please cite the papers in the References. Thank you.
Installation
Installation with pip
Pre-built packages are on PyPi https://pypi.org/project/kmedoids/ and can be installed with pip install kmedoids.
Compilation from source
You need to have Rust and Python 3 installed.
Installation uses maturin, for compiling and installing Rust extensions. Maturin is best used within a Python virtual environment.
# activate your desired virtual environment first
pip install maturin
git clone https://github.com/kno10/python-kmedoids.git
cd python-kmedoids
# build and install the package:
maturin develop --release
Integration test to validate the installation.
python -m unittest discover tests
This procedure uses the latest git version from https://github.com/kno10/rust-kmedoids. If you want to use local modifications to the Rust code, you need to provide the source folder of the Rust module in Cargo.toml by setting the path= option of the kmedoids dependency.
Example
import kmedoids
c = kmedoids.fasterpam(distmatrix, 5)
print("Loss is:", c.loss)
Using the sklearn-compatible API
Note that KMedoids defaults to the “precomputed” metric, expecting a pairwise distance matrix. If you have sklearn installed, you can use metric=”euclidean”.
import kmedoids
km = kmedoids.KMedoids(5, method='fasterpam')
c = km.fit(distmatrix)
print("Loss is:", c.inertia_)
MNIST (10k samples)
import kmedoids
import numpy
from sklearn.datasets import fetch_openml
from sklearn.metrics.pairwise import euclidean_distances
X, _ = fetch_openml('mnist_784', version=1, return_X_y=True, as_frame=False)
X = X[:10000]
diss = euclidean_distances(X)
start = time.time()
fp = kmedoids.fasterpam(diss, 100)
print("FasterPAM took: %.2f ms" % ((time.time() - start)*1000))
print("Loss with FasterPAM:", fp.loss)
start = time.time()
pam = kmedoids.pam(diss, 100)
print("PAM took: %.2f ms" % ((time.time() - start)*1000))
print("Loss with PAM:", pam.loss)
Implemented Algorithms
FasterPAM (Schubert and Rousseeuw, 2020, 2021)
FastPAM1 (Schubert and Rousseeuw, 2019, 2021)
PAM (Kaufman and Rousseeuw, 1987) with BUILD and SWAP
Alternating (k-means-style approach)
BUILD (Kaufman and Rousseeuw, 1987)
Silhouette (Kaufman and Rousseeuw, 1987)
Note that the k-means style “alternating” algorithm yields rather poor result quality.
FasterPAM
- kmedoids.fasterpam(diss, medoids, max_iter=100, init='random', random_state=None, n_cpu=- 1)
FasterPAM k-medoids clustering
This is an accelerated version of PAM clustering, that eagerly performs any swap found, and contains the O(k) improvement to find the best swaps faster.
References:
Erich Schubert, Peter J. RousseeuwFast and Eager k-Medoids Clustering:O(k) Runtime Improvement of the PAM, CLARA, and CLARANS AlgorithmsInformation Systems (101), 2021, 101804https://doi.org/10.1016/j.is.2021.101804 (open access)Erich Schubert, Peter J. Rousseeuw:Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS AlgorithmsIn: 12th International Conference on Similarity Search and Applications (SISAP 2019), 171-187.Preprint: https://arxiv.org/abs/1810.05691- Parameters
diss (ndarray) – square numpy array of dissimilarities
medoids (int or ndarray) – number of clusters to find or existing medoids
max_iter (int) – maximum number of iterations
init (str, "random", "first" or "build") – initialization method
random_state (int, RandomState instance or None) – random seed (also used for shuffling the processing order)
n_cpu (int) – number of threads to use (-1: automatic)
- Returns
k-medoids clustering result
- Return type
FastPAM1
- kmedoids.fastpam1(diss, medoids, max_iter=100, init='random', random_state=None)
FastPAM1 k-medoids clustering
This is an accelerated version of PAM clustering, that performs the same swaps as the original PAM (given the same starting conditions), but finds the best swap O(k) times faster.
References:
Erich Schubert, Peter J. RousseeuwFast and Eager k-Medoids Clustering:O(k) Runtime Improvement of the PAM, CLARA, and CLARANS AlgorithmsInformation Systems (101), 2021, 101804https://doi.org/10.1016/j.is.2021.101804 (open access)Erich Schubert, Peter J. Rousseeuw:Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS AlgorithmsIn: 12th International Conference on Similarity Search and Applications (SISAP 2019), 171-187.Preprint: https://arxiv.org/abs/1810.05691- Parameters
diss (ndarray) – square numpy array of dissimilarities
medoids (int or ndarray) – number of clusters to find or existing medoids
max_iter (int) – maximum number of iterations
init (str, "random", "first" or "build") – initialization method
random_state (int, RandomState instance or None) – random seed if no medoids are given
- Returns
k-medoids clustering result
- Return type
PAM
- kmedoids.pam(diss, medoids, max_iter=100, init='build', random_state=None)
PAM k-medoids clustering
This is an implementation of the original PAM (Partitioning Around Medoids) clustering algorithm. For improved versions, see the fastpam and fasterpam methods.
References:
Leonard Kaufman, Peter J. Rousseeuw:Clustering by means of medoids.In: Dodge Y (ed) Statistical Data Analysis Based on the L 1 Norm and Related Methods, pp 405–416, 1987Leonard Kaufman, Peter J. Rousseeuw:Finding Groups in Data: An Introduction to Cluster Analysis.John Wiley&Sons, 1990, https://doi.org/10.1002/9780470316801- Parameters
diss (ndarray) – square numpy array of dissimilarities
medoids (int or ndarray) – number of clusters to find or existing medoids
max_iter (int) – maximum number of iterations
init (str, "random", "first" or "build") – initialization method
random_state (int, RandomState instance or None) – random seed if no medoids are given
- Returns
k-medoids clustering result
- Return type
Alternating k=medoids (k-means style)
- kmedoids.alternating(diss, medoids, max_iter=100, init='random', random_state=None)
Alternating k-medoids clustering (k-means-style algorithm)
Note: this yields substantially worse results than PAM algorithms on difficult data sets.
- Parameters
diss (ndarray) – square numpy array of dissimilarities
medoids (int or ndarray) – number of clusters to find or existing medoids
max_iter (int) – maximum number of iterations
init (str, "random", "first" or "build") – initialization method
random_state (int, RandomState instance or None) – random seed if no medoids are given
- Returns
k-medoids clustering result
- Return type
PAM BUILD
- kmedoids.pam_build(diss, k)
PAM k-medoids clustering – BUILD only
This is an implementation of the original PAM (Partitioning Around Medoids) clustering algorithm. For improved versions, see the fastpam and fasterpam methods.
References:
Leonard Kaufman, Peter J. Rousseeuw:Clustering by means of medoids.In: Dodge Y (ed) Statistical Data Analysis Based on the L 1 Norm and Related Methods, 405-416, 1987Leonard Kaufman, Peter J. Rousseeuw:Finding Groups in Data: An Introduction to Cluster Analysis.John Wiley&Sons, 1990, https://doi.org/10.1002/9780470316801- Parameters
diss (ndarray) – square numpy array of dissimilarities
k (int) – number of clusters to find
- Returns
k-medoids clustering result
- Return type
Silhouette
- kmedoids.silhouette(diss, labels, samples=False, n_cpu=- 1)
Silhouette index for cluster evaluation.
The Silhouette, proposed by Peter Rousseeuw in 1987, is a popular internal evaluation measure for clusterings. Although it is defined on arbitary metrics, it is most appropriate for evaluating “spherical” clusters, as it expects objects to be closer to all members of its own cluster than to members of other clusters.
References:
Peter J. Rousseeuw:Silhouettes: A graphical aid to the interpretation and validation of cluster analysisJournal of Computational and Applied Mathematics, Volume 20, 1987- Parameters
diss (ndarray) – square numpy array of dissimilarities
labels (ndarray of int) – cluster labels (use 0 to k-1, no negative values allowed)
samples (boolean) – whether to return individual samples or not
n_cpu (int) – number of threads to use (-1: automatic)
- Returns
tuple containing the overall silhouette and the individual samples
- Return type
(float, ndarray)
k-Medoids result object
- class kmedoids.KMedoidsResult(loss, labels, medoids, n_iter=None, n_swap=None)
K-medoids clustering result
- Parameters
loss (float) – Loss of this clustering (sum of deviations)
labels (ndarray) – Cluster assignment
medoids (ndarray) – Chosen medoid indexes
n_iter (int) – Number of iterations
n_swap (int) – Number of swaps performed
sklearn-compatible API
- class kmedoids.KMedoids(n_clusters, *, metric='precomputed', metric_params=None, method='fasterpam', init='random', max_iter=300, random_state=None)
K-Medoids Clustering using PAM and FasterPAM (sklearn-compatible API).
References:
Erich Schubert, Peter J. RousseeuwFast and Eager k-Medoids Clustering:O(k) Runtime Improvement of the PAM, CLARA, and CLARANS AlgorithmsInformation Systems (101), 2021, 101804https://doi.org/10.1016/j.is.2021.101804 (open access)Erich Schubert, Peter J. Rousseeuw:Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS AlgorithmsIn: 12th International Conference on Similarity Search and Applications (SISAP 2019), 171-187.Preprint: https://arxiv.org/abs/1810.05691Leonard Kaufman, Peter J. Rousseeuw:Clustering by means of medoids.In: Dodge Y (ed) Statistical Data Analysis Based on the L 1 Norm and Related Methods, 405-416, 1987Leonard Kaufman, Peter J. Rousseeuw:Finding Groups in Data: An Introduction to Cluster Analysis.John Wiley&Sons, 1990, https://doi.org/10.1002/9780470316801- Parameters
n_clusters (int) – The number of clusters to form
metric (string, default: 'precomputed') – It is recommended to use ‘precomputed’, in particular when experimenting with different n_clusters. If you have sklearn installed, you may pass any metric supported by sklearn.metrics.pairwise_distances.
metric_params (dict, default=None) – Additional keyword arguments for the metric function.
method (string, "fasterpam" (default), "fastpam1", "pam" or "alternate") – Which algorithm to use
init (string, "random" (default), "first" or "build") – initialization method
max_iter (int) – Specify the maximum number of iterations when fitting
random_state (int, RandomState instance or None) – random seed if no medoids are given
- Variables
cluster_centers_ – None for ‘precomputed’
medoid_indices_ – The indices of the medoid rows in X
labels_ – Labels of each point
inertia_ – Sum of distances of samples to their closest cluster center
- fit(X, y=None)
Fit K-Medoids to the provided data.
- Parameters
X ({array-like, sparse matrix}, shape = (n_samples, n_samples)) – Dataset to cluster
y – ignored
- Returns
self
- fit_predict(X, y=None)
Predict the closest cluster for each sample in X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input data
y (Ignored) – Not used, present for API consistency by convention
- Returns
Cluster labels
- Return type
ndarray of shape (n_samples,)
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- get_params(deep=True)
Get parameters for this estimator.
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- paramsdict
Parameter names mapped to their values.
- predict(X)
Predict the closest cluster for each sample in X.
- Parameters
X ({array-like, sparse matrix}, shape = (n_samples, n_samples)) – New data to predict
- Returns
Index of the cluster each sample belongs to
- Return type
array, shape = (n_query,)
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- **paramsdict
Estimator parameters.
- selfestimator instance
Estimator instance.
- transform(X)
Transforms X to cluster-distance space.
- Parameters
X ({array-like}, shape (n_query, n_features), or (n_query, n_indexed) if metric == 'precomputed') – Data to transform
- Returns
X transformed in the new space of distances to cluster centers
- Return type
{array-like}, shape=(n_query, n_clusters)
References
For further details on the implemented algorithm FasterPAM, see:
Erich Schubert, Peter J. RousseeuwFast and Eager k-Medoids Clustering:O(k) Runtime Improvement of the PAM, CLARA, and CLARANS AlgorithmsInformation Systems (101), 2021, 101804https://doi.org/10.1016/j.is.2021.101804 (open access)
an earlier (slower, and now obsolete) version was published as:
Erich Schubert, Peter J. Rousseeuw:Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS AlgorithmsIn: 12th International Conference on Similarity Search and Applications (SISAP 2019), 171-187.Preprint: https://arxiv.org/abs/1810.05691
This is a port of the original Java code from ELKI to Rust. The Rust version is then wrapped for use with Python.
If you use this code in scientific work, please cite above papers. Thank you.
License: GPL-3 or later
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
FAQ: Why GPL and not Apache/MIT/BSD?
Because copyleft software like Linux is what built the open-source community.
Tit for tat: you get to use my code, I get to use your code.