2006 AReviewandComparisonofMethodsfo

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Univariate Numerical Outlier Detection.

Notes

Cited By

Quotes

Abstract

Most real-world data sets contain outliers that have unusually large or small values when compared with others in the data set. Outliers may cause a negative effect on data analyses, such as ANOVA and regression, based on distribution assumptions, or may provide useful information about data when we look into an unusual response to a given study. Thus, outlier detection is an important part of data analysis in the above two cases. Several outlier labeling methods have been developed. Some methods are sensitive to extreme values, like the SD method, and others are resistant to extreme values, like Tukey's method. Although these methods are quite powerful with large normal data, it may be problematic to apply them to non-normal data or small sample sizes without knowledge of their characteristics in these circumstances. This is because each labeling method has different measures to detect outliers, and expected outlier percentages change differently according to the sample size or distribution type of the data.

Many kinds of data regarding public health are often skewed, usually to the right, and lognormal distributions can often be applied to such skewed data, for instance, surgical procedure times, blood pressure, and assessment of toxic compounds in environmental analysis. This paper reviews and compares several common and less common outlier labeling methods and presents information that shows how the percent of outliers changes in each method according to the skewness and sample size of lognormal distributions through simulations and application to real data sets. These results may help establish guidelines for the choice of outlier detection methods in skewed data, which are often sen in the public health field.

Table of Contents

1.0 INTRODUCTION.. p.1
  1.1 BACKGROUND. p.1
  1.2 OUTLIER DETECTION METHOD.. p.3
2.0 STATEMENT OF PROBLEM. p.5
3.0 OUTLIER LABELING METHOD. p.9
  3.1 STANDARD DEVIATION (SD) METHOD. p.9
  3.2 Z-SCORE.. p.10
  3.3 THE MODIFIED Z-SCORE.. p.11
  3.4 TUKEY’S METHOD (BOXPLOT). p.13
  3.5 ADJUSTED BOXPLOT.. p.14
  3.6 MADE METHOD.. p.17
  3.7 MEDIAN RULE.. p.17
4.0 SIMULATION STUDY AND RESULTS FOR THE FIVE SELECTED LABELING METHODS.. p.19
5.0 APPLICATION.. p.32
6.0 RECOMMENDATIONS. p.36
7.0 DISCUSSION AND CONCLUSIONS.. p.38
APPENDIX A.. p.40
   THE EXPECTATION, STANDARD DEVIATION AND SKEWNESS OF A LOGNORMAL DISTRIBUTION……………………………………………………………….40
APPENDIX B. p.42
   MAXIMUM Z SCORE………………………………………………………………….42
APPENDIX C.. p.44
   CLASSICAL AND MEDCOUPLE (MC) SKEWNESS………………………………..44
APPENDIX D.. p.47
   BREAKDOWN POINT………………………………………………………………….47
APPENDIX E. p.48
   PROGRAM CODE FOR OUTLIER LABELING METHODS………………………..48

LIST OF TABLES

Table 1: Basic Statistic of a Simple Data Set. p.2
Table 2: Basic Statistic After Changing 7 into 77 in the Simple Data Set. p.2
Table 3: Computation and Masking Problem of the Z-Score. p.11
Table 4: Computation of Modified Z-Score and its Comparison with the Z-Score. p.12
Table 5: The Average Percentage of Left Outliers, Right Outliers and the Average Total Percent of Outliers for the Lognormal Distributions with the Same Mean and Different Variances (mean=0, variance=0.22, 0.42, 0.62, 0.82, 1.02) and the Standard Normal Distribution with Different Sample Sizes.. p.27
Table 6: Interval, Left, Right, and Total Number of Outliers According to the Five Outlier Methods.. p.34

LIST OF FIGURES

Figure 1: Probability density function for a normal distribution according to the standard deviation.. p.5
Figure 2: Theoretical Change of Outliers’ Percentage According to the Skewness of the Lognormal Distributions in the SD Method and Tukey’s Method.. p.7
Figure 3: Density Plot and Dotplot of the Lognormal Distribution (sample size=50) with Mean=1 and SD=1, and its Logarithm, Y=log(x). p.8
Figure 4: Boxplot for the Example Data Set.. p.13
Figure 5: Boxplot and Dotplot. (Note: No outlier shown in the boxplot). p.14
Figure 6: Change of theIintervals of Two Different Boxplot Methods. p.16
Figure 7: Stnadard Normal Distribution and Lognormal Distributions	.. p.20
Figure 8: Change in the Outlier Percentages According to the Skewness of the Data.. p.22
Figure 9: Change in the Total Percentages of Outliers According to the Sample Size. p.25
Figure 10: Histogram and Basic Statistics of Case 1-Case 4.. p.32
Figure 11: Flowchart of Outlier Labeling Methods.. p.37
Figure 12: Change of the Two Types of Skewness Coefficients According to the Sample Size and Data Distribution. (Note: This results came from the previous simulation. All the values are in Table 5). p.46

APPENDIX A. p.40 - THE EXPECTATION, STANDARD DEVIATION AND SKEWNESS OF A LOGNORMAL DISTRIBUTION

Let X denote a random variable having a lognormal distribution, and then its natural logarithm, Y = log (X), has a normal distribution. Aitchison and Brown (1957) note that when Y has mean value E(Y)=µ, and variance Var(Y)=s 2, the expected value and standard deviation of the original variable X are as follows:

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2006 AReviewandComparisonofMethodsfoSongwon SeoA Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets2006