Vertical Consensus Inference (VCI)
High-dimensional clustering is notoriously difficult. When the number of features explodes, classical Bayesian mixture models often fail in a fundamental way—their posterior distributions collapse to trivial solutions.
In this post, we break down a new framework called Vertical Consensus Inference (VCI) that tackles this issue using ideas from:
- Consensus Monte Carlo
- Optimal transport (Wasserstein barycenters)
- Variational inference
- Generalized Bayes
🚨 The Core Problem
In Bayesian clustering (e.g., Dirichlet process mixtures), we model a random partition of data.
But in high dimensions:
- Likelihood dominates → overfitting
- Prior struggles to regularize
- Posterior degenerates into:
- ❌ One giant cluster
- ❌ All singleton clusters
💡 Key Idea: Split Features, Not Samples
Traditional Consensus Monte Carlo (CMC) splits data horizontally (by observations).
VCI instead splits vertically (by features):
- Same number of observations $n$
- Fewer dimensions $p_k$
This directly targets the curse of dimensionality.
🧠 VCI Pipeline
Step 1: Vertical Sharding
\[X_i = (X_i^{(1)}, \dots, X_i^{(K)})\]Step 2: Local Bayesian Clustering
Each shard produces:
\[p_k(\mathbf{z} \mid X^{(k)})\]Step 3: Wasserstein Consensus
\[\bar{p} = \arg\min_p \sum_{k=1}^K \lambda_k W_{c,\epsilon}(p, p_k)\]👉 This is a Wasserstein barycenter over shard posteriors.
🔍 Why Wasserstein?
Partitions are discrete objects, so we need a meaningful metric.
We use Variation of Information (VoI):
\[\text{VoI}(z_1, z_2) = H(z_1) + H(z_2) - 2I(z_1, z_2)\]This gives a proper geometry over partitions.
⚖️ Smart Weighting of Shards
Weights $ \lambda_k $ prioritize good shards:
- Avoid trivial partitions
- Favor balanced clusters
- Penalize uncertainty
This is crucial when many dimensions are noisy.
🔧 Computation
The barycenter becomes:
\[\min_{\alpha} \sum_k \lambda_k W_{c,\epsilon}(\alpha, \alpha^{(k)})\]Solved via iterative Bregman projections (Sinkhorn-like updates).
In practice:
- Use MCMC samples as atoms
- Avoid full partition space
📊 Experimental Insights
Scenario 1: Low Dimension
- VCI ≈ full posterior
- Better than individual shards
Scenario 2: Noisy Dimensions
- Automatically downweights noise
Scenario 3: High-Dimensional Data (25k features)
- Outperforms full model
🔗 Theoretical Insight: Why VCI Works
VCI is not heuristic — it is grounded in variational inference.
We now show the key result.
📌 Proof Sketch: VCI as Variational Inference
We prove that the Wasserstein barycenter minimizes an upper bound of the NELBO.
🔧 Step 0: Hierarchical Model (Key Setup)
The entire argument relies on the following latent-variable model:
\[\mathbf{z} \sim p(\mathbf{z})\] \[\mathbf{z}^{(k)} \mid \mathbf{z} \propto \exp\big(-\zeta_k \, c(\mathbf{z}^{(k)}, \mathbf{z})\big)\] \[X^{(k)} \mid \mathbf{z}^{(k)} \sim p_k(X^{(k)} \mid \mathbf{z}^{(k)})\]So the joint distribution is:
\[p(\mathbf{z}, \mathbf{z}^{(1:K)}, X) = p(\mathbf{z}) \prod_{k=1}^K p(\mathbf{z}^{(k)} \mid \mathbf{z}) \; p_k(X^{(k)} \mid \mathbf{z}^{(k)})\]👉 Key intuition:
- $\mathbf{z}$ = global consensus partition
- $\mathbf{z}^{(k)}$ = shard-specific partitions
- The exponential term enforces soft alignment via a transport cost
Step 1: Lower bound the entropy
\[H_q(\mathbf{z}, \mathbf{z}^{(1:K)}) = H_q(\mathbf{z}) + H_q(\mathbf{z}^{(1:K)} \mid \mathbf{z})\]Using non-negativity:
\[H_q(\mathbf{z}, \mathbf{z}^{(1:K)}) \geq \frac{1}{K} \sum_{k=1}^K H_q(\mathbf{z}, \mathbf{z}^{(k)})\]Step 2: Define an upper bound
\[\overline{\mathcal{L}}(q) = \mathbb{E}_q[-\log p(\mathbf{z}, \mathbf{z}^{(1:K)}, X)] - \frac{1}{K} \sum_{k=1}^K H_q(\mathbf{z}, \mathbf{z}^{(k)})\]Then:
\[\overline{\mathcal{L}}(q) \geq \mathcal{L}(q)\]Step 3: Expand the model (using Step 0)
\[\log p(\mathbf{z}, \mathbf{z}^{(1:K)}, X) = \sum_{k=1}^K \left[ -\zeta_k c(\mathbf{z}^{(k)}, \mathbf{z}) + \log p_k(X^{(k)} \mid \mathbf{z}^{(k)}) \right] - C\]Step 4: Optimize over couplings
\[\min_{q} \mathcal{L}(q) \leq \min_{q} \overline{\mathcal{L}}(q)\]Step 5: Restrict to star structure
\[q(\mathbf{z}, \mathbf{z}^{(1:K)}) = q_0(\mathbf{z}) \prod_{k=1}^K q(\mathbf{z}^{(k)} \mid \mathbf{z})\]This simplifies optimization.
Step 6: Decompose across shards
\[= \sum_{k=1}^K \min_{q_{0,k}} \mathbb{E}\left[ \zeta_k c(\mathbf{z}^{(k)}, \mathbf{z}) + \frac{1}{K} \log q(\mathbf{z}, \mathbf{z}^{(k)}) \right]\]Step 7: Recognize entropic OT
This becomes:
\[\sum_{k=1}^K \zeta_k W_{c, \frac{1}{K \zeta_k}}(q_0, q_k)\]Step 8: Final bound
\[\min_{q} \mathcal{L}(q) \leq \sum_{k=1}^K \zeta_k W_{c, \frac{1}{K \zeta_k}}(q_0, q_k) - \sum_{k=1}^K \mathbb{E}_{q_k}[\log p_k(X^{(k)} \mid \mathbf{z}^{(k)})] + C\]Fix $q_k$ as the shard posteriors, $q_k(\mathbf{z}^{(k)}) = p_k(\mathbf{z}^{(k)} \mid X^{(k)})$, minimizing the right hand side of the equation with respect to $q_0$ leads to the barycenter problem as $\mathbb{E}_{ \mathbf{z}^{(k)}\sim q_k}\left[\log p_k (X^{(k)} \mid \mathbf{z}^{(k)})\right]$ is a constant with respect to $q_0$.
✅ Key Takeaway
The Wasserstein barycenter:
- Minimizes an upper bound of the ELBO
- Provides a principled variational approximation
- Connects optimal transport with Bayesian inference
🚀 Why VCI Matters
- Fixes high-dimensional failure of Bayesian clustering
- Parallel and scalable
- Works with MCMC / VI / SMC
- Theoretically grounded
🔮 Future Work
- Better weighting schemes
- Faster optimal transport
- Joint shard + consensus learning
- Extensions beyond clustering
🧾 Final Thought
VCI turns high-dimensional clustering from a failure mode into a principled, scalable inference problem by combining divide-and-conquer + optimal transport + generalized Bayes + variational inference.