Maximum Mean Discrepancy (MMD) Measure
A Maximum Mean Discrepancy (MMD) Measure is a statistical divergence measure that can be used to create distribution comparison systems that support two-sample hypothesis testing.
- AKA: Kernel Mean Discrepancy, RKHS Distance Metric.
- Context:
- Mathematical Definition:
- Given two distributions [math]\displaystyle{ p }[/math] and [math]\displaystyle{ q }[/math] over domain [math]\displaystyle{ \mathcal{X} }[/math], and a class of functions [math]\displaystyle{ \mathcal{F} }[/math], the Maximum Mean Discrepancy is defined as:
- [math]\displaystyle{ \text{MMD}[\mathcal{F}, p, q] := \sup_{f \in \mathcal{F}} \left( \mathbb{E}_{x \sim p}[f(x)] - \mathbb{E}_{y \sim q}[f(y)] \right) }[/math].
- When [math]\displaystyle{ \mathcal{F} }[/math] is the unit ball in a RKHS associated with a kernel [math]\displaystyle{ k }[/math], the squared MMD becomes:
- [math]\displaystyle{ \text{MMD}^2(p, q) = \mathbb{E}_{x,x'}[k(x,x')] + \mathbb{E}_{y,y'}[k(y,y')] - 2 \mathbb{E}_{x,y}[k(x,y)] }[/math].
- It can quantify the difference between two probability distributions by embedding them in a reproducing kernel Hilbert space (RKHS) and measuring the distance between their means.
- It can support two-sample hypothesis testing by serving as a non-parametric test statistic.
- It can support domain adaptation by minimizing the divergence between source and target domain feature distributions.
- It can serve as a loss function in generative modeling frameworks, such as generative moment matching networks and kernel-based GANs.
- It can assist in detecting dataset shift and concept drift in streaming and evolving data environments.
- It can be extended through variants like weighted maximum mean discrepancy (for class imbalance) and joint maximum mean discrepancy (for joint distribution alignment).
- It can be computed in closed form for kernel functions and supports linear-time approximations for large-scale data.
- It can be used to compare distributions without requiring density estimation, making it suitable for high-dimensional data.
- It can be estimated empirically using kernel functions such as Gaussian or polynomial kernels.
- It can be integrated into machine learning pipelines for tasks like anomaly detection, transfer learning, and model evaluation.
- It can be extended to measure discrepancies in structured data, including time series and graphs.
- It can range from being a simple empirical estimator to being a complex, kernel-based measure, depending on the choice of kernel and computational considerations.
- It can range from being a general-purpose distribution comparison tool to being a specialized component in specific machine learning algorithms.
- ...
- Mathematical Definition:
- Example(s):
- Two-Sample Hypothesis Testing, which uses MMD to determine if two datasets originate from the same distribution.
- Domain Adaptation techniques, which minimize MMD to align feature distributions across domains.
- Generative Adversarial Networks, which employ MMD as a loss function to improve the quality of generated samples.
- Kernel Mean Matching methods, which utilize MMD for sample reweighting in covariate shift scenarios.
- Joint Maximum Mean Discrepancy in deep domain adaptation to match joint feature-label distributions.
- ...
- Counter-Example(s):
- Kullback-Leibler Divergence, which requires explicit density estimation and may diverge when supports do not overlap.
- Wasserstein Distance, which considers the geometry of the data and is computationally more expensive in high dimensions.
- Total Variation Distance, which does not leverage kernel-based embeddings and can be overly sensitive to sample noise.
- Euclidean Distance, which measures point-wise differences and does not capture distributional discrepancies.
- ...
- See: Integral Probability Metric, Reproducing Kernel Hilbert Space, Kernel Method, Domain Adaptation, Generative Moment Matching Network, Two-Sample Hypothesis Testing, Multi-Agent Reinforcement Learning System.
References
2025
- (Wikipedia, 2025) ⇒ "Kernel embedding of distributions". In: Wikipedia. Retrieved:2025-05-25.
- QUOTE: The kernel embedding of distributions (also called the kernel mean or mean map) is a nonparametric method representing a probability distribution as an element of a reproducing kernel Hilbert space (RKHS). This framework enables comparison and manipulation of distributions using Hilbert space operations, such as inner products and distances, and can preserve all statistical features of arbitrary distributions if a characteristic kernel is used. The maximum mean discrepancy (MMD) is a distance measure between distributions defined as the distance between their RKHS embeddings, and is widely used for two-sample tests and domain adaptation.
2023
- (Shnarch et al., 2023) ⇒ Eyal Shnarch, Ariel Gera, Alon Halfon, Lena Dankin, Leshem Choshen, Ranit Aharonov, & Noam Slonim. (2023). "Cluster & Tune: Boost Cold Start Performance in Text Classification".
- QUOTE: Cluster & Tune introduces an intermediate unsupervised clustering step between pretraining and fine-tuning of pretrained language models to address the cold start problem in text classification. Clustering unlabeled data and using cluster assignments as pseudo-labels for intermediate training significantly improves performance when labeled data is scarce.
2022
- (Machine Learning Note, 2022) ⇒ Machine Learning Note. (2022). "Maximum Mean Discrepancy (MMD)".
- QUOTE: Maximum mean discrepancy (MMD) is a statistical test for measuring the difference between two distributions based on their embeddings in a reproducing kernel Hilbert space. MMD is used for two-sample testing, domain adaptation, and generative model evaluation.
2019
- (Tunali, 2019) ⇒ Onur Tunali. (2019). "Maximum Mean Discrepancy in Machine Learning".
- QUOTE: MMD computes the distance between the means of two distributions in a kernel-induced feature space. It is a nonparametric method that does not require density estimation and is widely used for domain adaptation and distribution comparison in machine learning.
2015a
- (Li et al., 2015a) ⇒ Chunyuan Li, Kevin Swersky, and Richard Zemel (2015a). "Generative Moment Matching Networks". In: Proceedings of the 32nd International Conference on Machine Learning (ICML).
- QUOTE: Generative Moment Matching Networks use maximum mean discrepancy as a training criterion to align generated data with the true data distribution.
The method enables implicit generative modeling without the need for an adversarial framework.
This approach demonstrates competitive results compared to GANs across a range of benchmark datasets.
- QUOTE: Generative Moment Matching Networks use maximum mean discrepancy as a training criterion to align generated data with the true data distribution.
2015b
- (Long et al., 2015b) ⇒ Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan (2015b). "Learning Transferable Features with Deep Adaptation Networks". In: Proceedings of the 32nd International Conference on Machine Learning (ICML).
- QUOTE: The Deep Adaptation Network minimizes the maximum mean discrepancy across domain layers to enable effective feature transfer.
By embedding MMD into deep neural networks, the model reduces domain shift and improves transfer learning performance.
Empirical results show significant gains on several domain adaptation benchmarks.
- QUOTE: The Deep Adaptation Network minimizes the maximum mean discrepancy across domain layers to enable effective feature transfer.
2012
- (Gretton et al., 2012) ⇒ Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola (2012). "A Kernel Two-Sample Test". In: Journal of Machine Learning Research.
- QUOTE: The maximum mean discrepancy (MMD) measures the distance between probability distributions using reproducing kernel Hilbert space embeddings.
MMD provides a non-parametric method for two-sample hypothesis testing with rigorous statistical guarantees.
Its power depends on the choice of the kernel function and sample size.
- QUOTE: The maximum mean discrepancy (MMD) measures the distance between probability distributions using reproducing kernel Hilbert space embeddings.
2010
- (Sriperumbudur et al., 2010) ⇒ Bharath K. Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Bernhard Schölkopf, and Gert R. G. Lanckriet (2010). "Hilbert Space Embeddings and Metrics on Probability Measures". In: Journal of Machine Learning Research.
- QUOTE: Hilbert space embeddings offer a general framework for defining and computing metrics on probability distributions.
The paper formalizes conditions under which MMD is a valid metric and compares it with other statistical divergence measures.
Theoretical analysis clarifies when kernel-based discrepancy measures are strictly greater than zero.
- QUOTE: Hilbert space embeddings offer a general framework for defining and computing metrics on probability distributions.
2009
- (Chen et al., 2009) ⇒ Bo Chen, Wai Lam, Ivor Tsang, and Tak-Lam Wong. (2009). “Extracting Discrimininative Concepts for Domain Adaptation in Text Mining.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557045
- … Maximum Mean Discrepancy (MMD) [5] is adopted to measure the embedded distribution difference between the source domain with sufficient but finite labeled data and the target domain with sufficient unlabeled data.
2007
- (Gretton et al., 2007) ⇒ A. Gretton, K. Borgwardt, M. Rasch, Bernhard Schölkopf, and Alexander J. Smola. (2007). “A Kernel Method for the Two-Sample Problem.” In: Advances in Neural Information Processing Systems, 19.
- … We call this statistic the Maximum Mean Discrepancy (MMD). ...
2006
- (Borgwardt et al., 2006) ⇒ Karsten M. Borgwardt, Arthur Gretton, Malte J. Rasch, Hans-Peter Kriegel, Bernhard Schölkopf, and Alexander J. Smola (2006). "Integrating Structured Biological Data by Kernel Maximum Mean Discrepancy". In: Bioinformatics.
- QUOTE: Kernel maximum mean discrepancy is applied to structured biological data to test distributional differences.
This approach shows high accuracy in identifying biological variability across sample groups.
The results support the use of MMD in practical bioinformatics applications.
- QUOTE: Kernel maximum mean discrepancy is applied to structured biological data to test distributional differences.