2014 EmpiricalGlitchExplanations

Jump to: navigation, search

Subject Headings:


Cited By


Author Keywords


Data glitches are unusual observations that do not conform to data quality expectations, be they logical, semantic or statistical. By applying data integrity constraints, potentially large sections of data could be flagged as being noncompliant. Ignoring or repairing significant sections of the data could fundamentally bias the results and conclusions drawn from analyses. In the context of Big Data where large numbers and volumes of feeds from disparate sources are integrated, it is likely that significant portions of seemingly noncompliant data are actually legitimate usable data.

In this paper, we introduce the notion of Empirical Glitch Explanations - concise, multi-dimensional descriptions of subsets of potentially dirty data - and propose a scalable method for empirically generating such explanatory characterizations. The explanations could serve two valuable functions: (1) Provide a way of identifying legitimate data and releasing it back into the pool of clean data. In doing so, we reduce cleaning-related statistical distortion of the data; (2) Used to refine existing data quality constraints and generate and formalize domain knowledge.

We conduct experiments using real and simulated data to demonstrate the scalability of our method and the robustness of explanations. In addition, we use two real world examples to demonstrate the utility of the explanations where we reclaim over 99% of the suspicious data, keeping data repair related statistical distortion close to 0.



 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2014 EmpiricalGlitchExplanationsTamraparni Dasu
Ji Meng Loh
Divesh Srivastava
Empirical Glitch Explanations10.1145/2623330.26237162014