Fβ-Score Measure
An Fβ-Score Measure is a harmonic mean balanced bounded monotonic binary classification measure that computes a parameterized weighted harmonic mean of precision and recall where the parameter β represents the relative importance of recall to precision.
- AKA: Van Rijsbergen's F.
- Context:
- It can typically compute F-Measure Scores through f-measure parameterized harmonic mean of f-measure precision and f-measure recall with f-measure beta weight parameter.
- It can typically weight F-Measure Precision Components through f-measure beta-squared factor in f-measure denominator term.
- It can typically weight F-Measure Recall Components through f-measure unit factor in f-measure denominator term.
- It can typically behave as a Bounded Monotonic Utility Measure through f-measure normalized range from zero to one.
- It can typically provide F-Measure Interpretable Scores through f-measure higher-is-better principle.
- It can typically handle F-Measure Class Imbalances through f-measure precision-recall focus rather than f-measure accuracy bias.
- It can typically support F-Measure Parameter Tuning through f-measure beta adjustment for f-measure application-specific requirements.
- It can typically express F-Measure User Preferences through f-measure beta parameter representing f-measure recall-to-precision importance ratio.
- It can typically achieve F-Measure Maximum Values only when both f-measure precision and f-measure recall are maximized.
- ...
- It can often penalize F-Measure Extreme Values through f-measure harmonic mean properties.
- It can often guide F-Measure Model Selections through f-measure performance comparisons across f-measure candidate models.
- It can often enable F-Measure Threshold Optimizations through f-measure score maximization over f-measure decision boundarys.
- It can often facilitate F-Measure Cross-Validations through f-measure fold evaluations in f-measure model assessment.
- It can often support F-Measure Multi-Class Extensions through f-measure averaging strategies like f-measure macro-averaging and f-measure micro-averaging.
- It can often be preferred over Accuracy Metrics when f-measure positive classes are rare or f-measure class costs differ.
- It can often be computed from F-Measure Confusion Matrix Elements: f-measure true positives, f-measure false positives, and f-measure false negatives.
- ...
- It can range from being a Precision-Focused F-Measure to being a Recall-Focused F-Measure, depending on its f-measure beta parameter value.
- It can range from being a Binary F-Measure to being a Multi-Class F-Measure, depending on its f-measure classification scope.
- It can range from being a Micro-Averaged F-Measure to being a Macro-Averaged F-Measure, depending on its f-measure aggregation strategy.
- It can range from being a Standard F-Measure to being a Weighted F-Measure, depending on its f-measure class importance.
- It can range from being a Point F-Measure to being an Interval F-Measure, depending on its f-measure temporal scope.
- It can range from being a Hard F-Measure to being a Soft F-Measure, depending on its f-measure prediction type.
- It can range from being a Single-Label F-Measure to being a Multi-Label F-Measure, depending on its f-measure label assignment.
- ...
- It can be calculated using F-Measure General Formula: F_β = (1 + β²) × (precision × recall) / (β² × precision + recall).
- It can be derived from Van Rijsbergen's Effectiveness Measure through f-measure complementary transformation.
- It can be interpreted as F-Measure Weighted Average biased toward the f-measure lower value between f-measure precision and f-measure recall.
- It can be optimized directly as F-Measure Loss Functions in f-measure machine learning training.
- It can be approximated through F-Measure Surrogate Functions for f-measure gradient-based optimization.
- ...
- It can evaluate F-Measure Classification Models for f-measure performance assessment.
- It can optimize F-Measure Decision Thresholds for f-measure operating point selection.
- It can compare F-Measure Detection Systems for f-measure relative ranking.
- It can measure F-Measure Information Retrieval Systems for f-measure relevance evaluation.
- It can assess F-Measure Named Entity Recognition Systems for f-measure extraction quality.
- It can benchmark F-Measure Machine Translation Systems for f-measure translation accuracy.
- It can validate F-Measure Medical Diagnostic Systems for f-measure clinical performance.
- ...
- Example(s):
- Balanced F-Measures, such as:
- F1-Score Metric (β=1), which weights f-measure precision and f-measure recall equally for f-measure balanced evaluation.
- F1-Measure for f-measure equal importance of f-measure false positives and f-measure false negatives.
- Standard F1 Score used as default in most f-measure classification evaluations.
- Recall-Emphasized F-Measures, such as:
- F2-Measure (β=2), which weights f-measure recall twice as much as f-measure precision.
- F3-Measure (β=3), which strongly emphasizes f-measure recall over f-measure precision.
- F5-Measure (β=5), which heavily prioritizes f-measure recall for f-measure high-sensitivity applications.
- F10-Measure (β=10), which almost exclusively focuses on f-measure recall.
- Precision-Emphasized F-Measures, such as:
- F0.5-Measure (β=0.5), which weights f-measure precision twice as much as f-measure recall.
- F0.3-Measure (β=0.3), which strongly emphasizes f-measure precision over f-measure recall.
- F0.2-Measure (β=0.2), which heavily prioritizes f-measure precision for f-measure high-precision applications.
- F0.1-Measure (β=0.1), which almost exclusively focuses on f-measure precision.
- Domain-Specific F-Measures, such as:
- Information Retrieval F-Measures, such as:
- Medical Diagnosis F-Measures, such as:
- Cancer Screening F2-Measure prioritizing f-measure sensitivity to avoid f-measure missed cases.
- Confirmation Test F0.5-Measure prioritizing f-measure specificity to avoid f-measure false alarms.
- Emergency Triage F3-Measure for f-measure critical case detection.
- Surgical Planning F0.3-Measure for f-measure precise diagnosis.
- Natural Language Processing F-Measures, such as:
- Computer Vision F-Measures, such as:
- Security System F-Measures, such as:
- Intrusion Detection F2-Measure prioritizing f-measure threat detection.
- Fraud Detection F1-Measure balancing f-measure false alarms and f-measure missed frauds.
- Spam Filter F0.5-Measure prioritizing f-measure legitimate email preservation.
- Temporally-Extended F-Measures, such as:
- Multi-Class F-Measure Extensions, such as:
- Micro-F1 Measure aggregating f-measure global counts across all classes.
- Macro-F1 Measure averaging f-measure class-wise scores unweighted.
- Weighted F1-Measure using f-measure class support weights.
- Per-Class F-Measures computing separate f-measure scores for each class.
- Multi-Label F-Measures, such as:
- ...
- Balanced F-Measures, such as:
- Counter-Example(s):
- Accuracy Metric, which treats all classification errors equally without precision-recall balance.
- AUC-ROC Metric, which evaluates threshold-independent performance rather than at a specific operating point.
- Mean Squared Error, which measures regression errors rather than classification performance.
- Matthews Correlation Coefficient, which uses all confusion matrix elements including true negatives.
- G-Mean, which uses geometric mean of sensitivity and specificity rather than harmonic mean.
- Precision Metric, which only considers false positives without false negatives.
- Recall Metric, which only considers false negatives without false positives.
- Logarithmic Loss, which evaluates probabilistic predictions rather than hard classifications.
- See: Bounded Monotonic Utility Measure, Classification Performance Measure, Harmonic Mean Function, Precision Metric, Recall Metric, Binary Classification Measure, F1-Score Metric, F2-Measure, F0.5-Measure, Macro-F1 Measure, Micro-F1 Measure, Van Rijsbergen's Effectiveness Measure, Confusion Matrix, True Positive, False Positive, False Negative, Multi-Class Classification Task, Information Retrieval Evaluation.
References
2011
- (Wikipedia, 2011) ⇒ http://en.wikipedia.org/wiki/F1_score
- QUOTE:In statistics, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.
The traditional F-measure or balanced F-score (F1 score) is the harmonic mean of precision and recall: :[math]\displaystyle{ F = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}} }[/math].
The general formula for positive real [math]\displaystyle{ β }[/math] is: :[math]\displaystyle{ F_\beta = (1 + \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{(\beta^2 \cdot \mathrm{precision}) + \mathrm{recall}} }[/math].
The formula in terms of Type I and type II errors: :[math]\displaystyle{ F_\beta = \frac {(1 + \beta^2) \cdot \mathrm{true\ positive} }{((1 + \beta^2) \cdot \mathrm{true\ positive} + \beta^2 \cdot \mathrm{false\ negative} + \mathrm{false\ positive})}\, }[/math].
Two other commonly used F measures are the [math]\displaystyle{ F_{2} }[/math] measure, which weights recall higher than precision, and the [math]\displaystyle{ F_{0.5} }[/math] measure, which puts more emphasis on precision than recall.
The F-measure was derived so that [math]\displaystyle{ F_\beta }[/math] "measures the effectiveness of retrieval with respect to a user who attaches [math]\displaystyle{ β }[/math] times as much importance to recall as precision" [1]. It is based on van Rijsbergen's effectiveness measure :[math]\displaystyle{ E = 1 - \left(\frac{\alpha}{P} + \frac{1-\alpha}{R}\right)^{-1} }[/math].
Their relationship is [math]\displaystyle{ F_\beta = 1 - E }[/math] where [math]\displaystyle{ \alpha=\frac{1}{1 + \beta^2} }[/math].
- QUOTE:In statistics, the F1 score (also F-score or F-measure) is a measure of a test's accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct results divided by the number of all returned results and r is the number of correct results divided by the number of results that should have been returned. The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0.
2009
- (Hu et al., 1999) ⇒ Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou. (2009). “Exploiting Wikipedia as External Knowledge for Document Clustering.” In: Proceedings of ACM SIGKDD Conference (KDD-2009). doi:10.1145/1557019.1557066
- QUOTE:Cluster quality is evaluated by three metrics, purity [14], F-score [10], and normalized mutual information (NMI) [15]. … F-score combines the information of precision and recall which is extensively applied in information retrieval. … All the three metrics range from 0 to 1, and the higher their value, the better the clustering quality is.
- ↑ van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). Butterworth.