- (Dasu et al., 2014) ⇒ Tamraparni Dasu, Ji Meng Loh, and Divesh Srivastava. (2014). “Empirical Glitch Explanations.” In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2014) Journal. ISBN:978-1-4503-2956-9 doi:10.1145/2623330.2623716
Data glitches are unusual observations that do not conform to data quality expectations, be they logical, semantic or statistical. By applying data integrity constraints, potentially large sections of data could be flagged as being noncompliant. Ignoring or repairing significant sections of the data could fundamentally bias the results and conclusions drawn from analyses. In the context of Big Data where large numbers and volumes of feeds from disparate sources are integrated, it is likely that significant portions of seemingly noncompliant data are actually legitimate usable data.
In this paper, we introduce the notion of Empirical Glitch Explanations - concise, multi-dimensional descriptions of subsets of potentially dirty data - and propose a scalable method for empirically generating such explanatory characterizations. The explanations could serve two valuable functions: (1) Provide a way of identifying legitimate data and releasing it back into the pool of clean data. In doing so, we reduce cleaning-related statistical distortion of the data; (2) Used to refine existing data quality constraints and generate and formalize domain knowledge.
We conduct experiments using real and simulated data to demonstrate the scalability of our method and the robustness of explanations. In addition, we use two real world examples to demonstrate the utility of the explanations where we reclaim over 99% of the suspicious data, keeping data repair related statistical distortion close to 0.
|2014 EmpiricalGlitchExplanations||Tamraparni Dasu|
Ji Meng Loh
|Empirical Glitch Explanations||10.1145/2623330.2623716||2014|
|Author||Tamraparni Dasu +, Ji Meng Loh + and Divesh Srivastava +|
|proceedings||Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining +|
|title||Empirical Glitch Explanations +|