High-dimensional clustering is notoriously difficult. When the number of features explodes, classical Bayesian mixture models often fail in a fundamental way—their posterior distributions collapse to trivial solutions.

In this post, we break down a new framework called Vertical Consensus Inference (VCI) that tackles this issue using ideas from:

Consensus Monte Carlo
Optimal transport (Wasserstein barycenters)
Variational inference
Generalized Bayes

🚨 The Core Problem

In Bayesian clustering (e.g., Dirichlet process mixtures), we model a random partition of data.

But in high dimensions:

Likelihood dominates → overfitting
Prior struggles to regularize
Posterior degenerates into:
- ❌ One giant cluster
- ❌ All singleton clusters

💡 Key Idea: Split Features, Not Samples

Traditional Consensus Monte Carlo (CMC) splits data horizontally (by observations).

VCI instead splits vertically (by features):

Same number of observations $n$
Fewer dimensions $p_k$

This directly targets the curse of dimensionality.

🧠 VCI Pipeline

Step 1: Vertical Sharding

\[X_i = (X_i^{(1)}, \dots, X_i^{(K)})\]

Step 2: Local Bayesian Clustering

Each shard produces:

\[p_k(\mathbf{z} \mid X^{(k)})\]

Step 3: Wasserstein Consensus

\[\bar{p} = \arg\min_p \sum_{k=1}^K \lambda_k W_{c,\epsilon}(p, p_k)\]

👉 This is a Wasserstein barycenter over shard posteriors.

🔍 Why Wasserstein?

Partitions are discrete objects, so we need a meaningful metric.

We use Variation of Information (VoI):

\[\text{VoI}(z_1, z_2) = H(z_1) + H(z_2) - 2I(z_1, z_2)\]

This gives a proper geometry over partitions.

⚖️ Smart Weighting of Shards

Weights $ \lambda_k $ prioritize good shards:

Avoid trivial partitions
Favor balanced clusters
Penalize uncertainty

This is crucial when many dimensions are noisy.

🔧 Computation

The barycenter becomes:

\[\min_{\alpha} \sum_k \lambda_k W_{c,\epsilon}(\alpha, \alpha^{(k)})\]

Solved via iterative Bregman projections (Sinkhorn-like updates).

In practice:

Use MCMC samples as atoms
Avoid full partition space

📊 Experimental Insights

Scenario 1: Low Dimension

VCI ≈ full posterior
Better than individual shards

Scenario 2: Noisy Dimensions

Automatically downweights noise

Scenario 3: High-Dimensional Data (25k features)

Outperforms full model

🔗 Theoretical Insight: Why VCI Works

VCI is not heuristic — it is grounded in variational inference.

We now show the key result.

📌 Proof Sketch: VCI as Variational Inference

We prove that the Wasserstein barycenter minimizes an upper bound of the NELBO.

🔧 Step 0: Hierarchical Model (Key Setup)

The entire argument relies on the following latent-variable model:

\[\mathbf{z} \sim p(\mathbf{z})\] \[\mathbf{z}^{(k)} \mid \mathbf{z} \propto \exp\big(-\zeta_k \, c(\mathbf{z}^{(k)}, \mathbf{z})\big)\] \[X^{(k)} \mid \mathbf{z}^{(k)} \sim p_k(X^{(k)} \mid \mathbf{z}^{(k)})\]

So the joint distribution is:

\[p(\mathbf{z}, \mathbf{z}^{(1:K)}, X) = p(\mathbf{z}) \prod_{k=1}^K p(\mathbf{z}^{(k)} \mid \mathbf{z}) \; p_k(X^{(k)} \mid \mathbf{z}^{(k)})\]

👉 Key intuition:

$\mathbf{z}$ = global consensus partition
$\mathbf{z}^{(k)}$ = shard-specific partitions
The exponential term enforces soft alignment via a transport cost

Step 1: Lower bound the entropy

\[H_q(\mathbf{z}, \mathbf{z}^{(1:K)}) = H_q(\mathbf{z}) + H_q(\mathbf{z}^{(1:K)} \mid \mathbf{z})\]

Using non-negativity:

\[H_q(\mathbf{z}, \mathbf{z}^{(1:K)}) \geq \frac{1}{K} \sum_{k=1}^K H_q(\mathbf{z}, \mathbf{z}^{(k)})\]

Step 2: Define an upper bound

\[\overline{\mathcal{L}}(q) = \mathbb{E}_q[-\log p(\mathbf{z}, \mathbf{z}^{(1:K)}, X)] - \frac{1}{K} \sum_{k=1}^K H_q(\mathbf{z}, \mathbf{z}^{(k)})\]

Then:

\[\overline{\mathcal{L}}(q) \geq \mathcal{L}(q)\]

Step 3: Expand the model (using Step 0)

\[\log p(\mathbf{z}, \mathbf{z}^{(1:K)}, X) = \sum_{k=1}^K \left[ -\zeta_k c(\mathbf{z}^{(k)}, \mathbf{z}) + \log p_k(X^{(k)} \mid \mathbf{z}^{(k)}) \right] - C\]

Step 4: Optimize over couplings

\[\min_{q} \mathcal{L}(q) \leq \min_{q} \overline{\mathcal{L}}(q)\]

Step 5: Restrict to star structure

\[q(\mathbf{z}, \mathbf{z}^{(1:K)}) = q_0(\mathbf{z}) \prod_{k=1}^K q(\mathbf{z}^{(k)} \mid \mathbf{z})\]

This simplifies optimization.

Step 6: Decompose across shards

\[= \sum_{k=1}^K \min_{q_{0,k}} \mathbb{E}\left[ \zeta_k c(\mathbf{z}^{(k)}, \mathbf{z}) + \frac{1}{K} \log q(\mathbf{z}, \mathbf{z}^{(k)}) \right]\]

Step 7: Recognize entropic OT

This becomes:

\[\sum_{k=1}^K \zeta_k W_{c, \frac{1}{K \zeta_k}}(q_0, q_k)\]

Step 8: Final bound

\[\min_{q} \mathcal{L}(q) \leq \sum_{k=1}^K \zeta_k W_{c, \frac{1}{K \zeta_k}}(q_0, q_k) - \sum_{k=1}^K \mathbb{E}_{q_k}[\log p_k(X^{(k)} \mid \mathbf{z}^{(k)})] + C\]

Fix $q_k$ as the shard posteriors, $q_k(\mathbf{z}^{(k)}) = p_k(\mathbf{z}^{(k)} \mid X^{(k)})$, minimizing the right hand side of the equation with respect to $q_0$ leads to the barycenter problem as $\mathbb{E}_{ \mathbf{z}^{(k)}\sim q_k}\left[\log p_k (X^{(k)} \mid \mathbf{z}^{(k)})\right]$ is a constant with respect to $q_0$.

✅ Key Takeaway

The Wasserstein barycenter:

Minimizes an upper bound of the ELBO
Provides a principled variational approximation
Connects optimal transport with Bayesian inference

🚀 Why VCI Matters

Fixes high-dimensional failure of Bayesian clustering
Parallel and scalable
Works with MCMC / VI / SMC
Theoretically grounded

🔮 Future Work

Better weighting schemes
Faster optimal transport
Joint shard + consensus learning
Extensions beyond clustering

🧾 Final Thought

VCI turns high-dimensional clustering from a failure mode into a principled, scalable inference problem by combining divide-and-conquer + optimal transport + generalized Bayes + variational inference.