Maximum Mean Discrepancy (MMD) Measure

A Maximum Mean Discrepancy (MMD) Measure is a statistical divergence measure that can be used to create distribution comparison systems that support two-sample hypothesis testing.

AKA: Kernel Mean Discrepancy, RKHS Distance Metric.
Context:
- Mathematical Definition:
  - Given two distributions [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] over domain [math]\displaystyle{ \mathcal{X} }[/math], and a class of functions [math]\displaystyle{ \mathcal{F} }[/math], the Maximum Mean Discrepancy is defined as:
  - [math]\displaystyle{ \text{MMD}[\mathcal{F}, p, q] := \sup_{f \in \mathcal{F}} \left( \mathbb{E}_{x \sim p}[f(x)] - \mathbb{E}_{y \sim q}[f(y)] \right) }[/math].
  - When [math]\displaystyle{ \mathcal{F} }[/math] is the unit ball in a RKHS associated with a kernel [math]\displaystyle{ k }[/math], the squared MMD becomes:
  - [math]\displaystyle{ \text{MMD}^2(p, q) = \mathbb{E}_{x,x'}[k(x,x')] + \mathbb{E}_{y,y'}[k(y,y')] - 2 \mathbb{E}_{x,y}[k(x,y)] }[/math].
- It can quantify the difference between two probability distributions by embedding them in a reproducing kernel Hilbert space (RKHS) and measuring the distance between their means.
- It can support two-sample hypothesis testing by serving as a non-parametric test statistic.
- It can support domain adaptation by minimizing the divergence between source and target domain feature distributions.
- It can serve as a loss function in generative modeling frameworks, such as generative moment matching networks and kernel-based GANs.
- It can assist in detecting dataset shift and concept drift in streaming and evolving data environments.
- It can be extended through variants like weighted maximum mean discrepancy (for class imbalance) and joint maximum mean discrepancy (for joint distribution alignment).
- It can be computed in closed form for kernel functions and supports linear-time approximations for large-scale data.
- It can be used to compare distributions without requiring density estimation, making it suitable for high-dimensional data.
- It can be estimated empirically using kernel functions such as Gaussian or polynomial kernels.
- It can be integrated into machine learning pipelines for tasks like anomaly detection, transfer learning, and model evaluation.
- It can be extended to measure discrepancies in structured data, including time series and graphs.
- It can range from being a simple empirical estimator to being a complex, kernel-based measure, depending on the choice of kernel and computational considerations.
- It can range from being a general-purpose distribution comparison tool to being a specialized component in specific machine learning algorithms.
- ...
Example(s):
- Two-Sample Hypothesis Testing, which uses MMD to determine if two datasets originate from the same distribution.
- Domain Adaptation techniques, which minimize MMD to align feature distributions across domains.
- Generative Adversarial Networks, which employ MMD as a loss function to improve the quality of generated samples.
- Kernel Mean Matching methods, which utilize MMD for sample reweighting in covariate shift scenarios.
- Joint Maximum Mean Discrepancy in deep domain adaptation to match joint feature-label distributions.
- ...
Counter-Example(s):
- Kullback-Leibler Divergence, which requires explicit density estimation and may diverge when supports do not overlap.
- Wasserstein Distance, which considers the geometry of the data and is computationally more expensive in high dimensions.
- Total Variation Distance, which does not leverage kernel-based embeddings and can be overly sensitive to sample noise.
- Euclidean Distance, which measures point-wise differences and does not capture distributional discrepancies.
- ...
See: Integral Probability Metric, Reproducing Kernel Hilbert Space, Kernel Method, Domain Adaptation, Generative Moment Matching Network, Two-Sample Hypothesis Testing, Multi-Agent Reinforcement Learning System.

References

2025

(Wikipedia, 2025) ⇒ "Kernel embedding of distributions". In: Wikipedia. Retrieved:2025-05-25.
- QUOTE: The kernel embedding of distributions (also called the kernel mean or mean map) is a nonparametric method representing a probability distribution as an element of a reproducing kernel Hilbert space (RKHS). This framework enables comparison and manipulation of distributions using Hilbert space operations, such as inner products and distances, and can preserve all statistical features of arbitrary distributions if a characteristic kernel is used. The maximum mean discrepancy (MMD) is a distance measure between distributions defined as the distance between their RKHS embeddings, and is widely used for two-sample tests and domain adaptation.

2023

(Shnarch et al., 2023) ⇒ Eyal Shnarch, Ariel Gera, Alon Halfon, Lena Dankin, Leshem Choshen, Ranit Aharonov, & Noam Slonim. (2023). "Cluster & Tune: Boost Cold Start Performance in Text Classification".
- QUOTE: Cluster & Tune introduces an intermediate unsupervised clustering step between pretraining and fine-tuning of pretrained language models to address the cold start problem in text classification. Clustering unlabeled data and using cluster assignments as pseudo-labels for intermediate training significantly improves performance when labeled data is scarce.

2022

(Machine Learning Note, 2022) ⇒ Machine Learning Note. (2022). "Maximum Mean Discrepancy (MMD)".
- QUOTE: Maximum mean discrepancy (MMD) is a statistical test for measuring the difference between two distributions based on their embeddings in a reproducing kernel Hilbert space. MMD is used for two-sample testing, domain adaptation, and generative model evaluation.

2019

(Tunali, 2019) ⇒ Onur Tunali. (2019). "Maximum Mean Discrepancy in Machine Learning".
- QUOTE: MMD computes the distance between the means of two distributions in a kernel-induced feature space. It is a nonparametric method that does not require density estimation and is widely used for domain adaptation and distribution comparison in machine learning.

2009

(Chen et al., 2009) ⇒ Bo Chen, Wai Lam, Ivor Tsang, and Tak-Lam Wong. (2009). “Extracting Discrimininative Concepts for Domain Adaptation in Text Mining.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557045
- … Maximum Mean Discrepancy (MMD) [5] is adopted to measure the embedded distribution difference between the source domain with sufficient but finite labeled data and the target domain with sufficient unlabeled data.

2007

(Gretton et al., 2007) ⇒ A. Gretton, K. Borgwardt, M. Rasch, Bernhard Schölkopf, and Alexander J. Smola. (2007). “A Kernel Method for the Two-Sample Problem.” In: Advances in Neural Information Processing Systems, 19.
- … We call this statistic the Maximum Mean Discrepancy (MMD). ...

2006

(Borgwardt et al., 2006) ⇒ Karsten M. Borgwardt, Arthur Gretton, Malte J. Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alexander J. Smola (2006). "Integrating Structured Biological Data by Kernel Maximum Mean Discrepancy". In: Bioinformatics.
- QUOTE: Kernel maximum mean discrepancy is applied to structured biological data to test distributional differences.
  This approach shows high accuracy in identifying biological variability across sample groups.
  The results support the use of MMD in practical bioinformatics applications.

Maximum Mean Discrepancy (MMD) Measure

References

2025

2023

2022

2019

2015a

2015b

2012

2010

2009

2007

2006

Navigation menu

Search