K-Means Clustering

An unsupervised machine learning technique that groups market states or assets into clusters based on similarity, enabling regime detection and portfolio construction.

Overview

K-Means Clustering is an unsupervised machine learning algorithm that partitions data into K groups (clusters) based on similarity. In trading, it is used to identify distinct market regimes (e.g., "trending high-volatility," "quiet uptrend," "bearish crash"), to group similar assets for portfolio construction, or to classify current market conditions and select the most appropriate trading strategy for each regime.

How it looks on a chart

Illustration only — synthetic data generated for visual reference.

Beginner

Clustering is different from all the other techniques here — instead of predicting where price will go, it identifies what type of market environment you are currently in. The idea is that markets go through different "moods" or regimes that repeat over time: sometimes they are trending strongly, sometimes they are choppy and sideways, sometimes they are crashing in high volatility. K-Means Clustering automatically discovers these regimes from historical data. You feed it market characteristics — like recent volatility, trend strength, and volume — and it groups similar market states into clusters. You can then study what happened after each cluster in the past and adjust your strategy accordingly. For example, cluster analysis might reveal that when markets are in "Regime 3" (characterized by low volatility, positive trend, and normal volume), trend-following strategies historically performed very well. In "Regime 1" (high volatility, no trend, declining volume), mean-reversion strategies worked better. This allows you to dynamically switch strategies based on the detected regime.

Intermediate

K-Means algorithm: (1) randomly initialize K centroids, (2) assign each observation to the nearest centroid by Euclidean distance, (3) update centroids to the mean of their assigned observations, (4) repeat until convergence. The number of clusters K must be specified; the Elbow Method or Silhouette Score help determine the optimal K. For market regime clustering, typical features per bar include: normalized ATR (rolling 20-day volatility), ADX (trend strength), RSI (momentum level), volume ratio, and price position relative to 200-day MA. These features should be standardized (z-scored) before clustering to prevent high-variance features from dominating. A practical three-regime model might yield: Regime 1 = uptrending, low volatility (bull market); Regime 2 = ranging, moderate volatility (consolidation); Regime 3 = downtrending, high volatility (bear market or crisis). Strategy selection based on detected regime can significantly improve portfolio Sharpe ratios by avoiding applying trend strategies in ranging regimes and vice versa.

Advanced

Gaussian Mixture Models (GMM) are a probabilistic alternative to K-Means that provide soft cluster assignments (probabilities of belonging to each regime) rather than hard boundaries. This is more appropriate for financial regimes, which rarely switch instantaneously. The Hidden Markov Model (HMM) extends GMM to model regime persistence — the probability that tomorrow's regime equals today's regime is high, reflecting the observed persistence of market states. For time series clustering (grouping similar assets rather than time periods), dynamic time warping (DTW) distance is more appropriate than Euclidean distance because DTW handles phase shifts between similar patterns. This is useful for identifying baskets of assets with similar price behavior for pair trading or sector rotation strategies. In academic portfolio construction, cluster-based approaches to correlation matrix shrinkage (Ledoit-Wolf combined with K-Means hierarchy) improve minimum variance portfolio performance by stabilizing the covariance matrix estimate. Hierarchical Risk Parity (HRP), developed by Lopez de Prado (2016), uses hierarchical clustering of asset returns to allocate risk more robustly than classical mean-variance optimization.

Formula

K-Means: minimize Σₖ Σ_{x∈Cₖ} ||x − μₖ||²
where μₖ = mean of cluster k, Cₖ = set of points in cluster k
Silhouette score: s(i) = (b(i) − a(i)) / max(a(i), b(i))

1.Prepare feature matrix: normalize market state features (ATR%, ADX, RSI, RVOL) to zero mean and unit variance.
2.Choose K using the Elbow Method (within-cluster sum of squares vs. K) or Silhouette Score.
3.Run K-Means multiple times with different random seeds; retain solution with lowest within-cluster variance.
4.Assign each historical bar to its cluster; analyze strategy performance by cluster.
5.Apply in live trading: classify current market state, activate the strategy suite best suited to detected regime.

Parameters

Parameter	Default	Range	Description
Number of Clusters (K)	3	2–8	Number of market regimes to identify.
Feature Window	20	5–60	Lookback window for computing regime features.
Max Iterations	300	50–1000	Maximum K-Means iterations per run.

Trading signals

bullish: Current state assigned to high-trend, low-volatility cluster

Bull regime detected — activate trend-following strategies, increase exposure.

bearish: Current state assigned to high-volatility, negative-trend cluster

Bear/crisis regime detected — reduce exposure, activate defensive strategies.

neutral: Current state assigned to low-ADX, moderate-volatility cluster

Ranging regime detected — activate mean-reversion strategies, reduce trend exposure.

neutral: Cluster assignment changes from previous bar

Regime transition signal — reassess active strategies for new environment.

Limitations

•K-Means assumes spherical, equally-sized clusters — financial regimes are rarely this well-structured.
•Results depend on random initialization and the choice of K, requiring multiple runs and careful validation.
•Clusters found in-sample may not persist out-of-sample as market dynamics evolve.
•Hard cluster boundaries do not reflect the gradual, fuzzy nature of market regime transitions.

How Gilito AI uses Clustering

Gilito uses K-Means clustering as its primary market regime detection system, classifying each day's market state across every asset in its universe using six normalized features. The detected regime determines which strategy sub-library is active, with regime-conditional backtesting showing which strategies perform best in each of the identified market states.

Related indicators

Random Forest

Logistic Regression

Average Directional Index