Record Deduplication Algorithm

A Record Deduplication Algorithm is a Coreference Resolution Algorithm that can solve a Record Deduplication Task and be applied by a Record Deduplication System.

Context:
- It can combine a Coreferent Record Detection Algorithm with a Record Canonicalization Algorithm.
- It can range from being:
  - a non-relational algorithm that considers pairwise attribute similarities between entities (Newcombe et al., 1959; Fellegi and Sunter, 1969)
  - a relational algorithm that considers the relationships that exist between entities (Ananthakrishna et al., 2002; Kalashnikov et al., 2005)
  - a collective algorithm that considers the relationship between various matching decisions, (Bhattacharya and Getoor, 2007; McCallum and Wellner, 2004).
See: Record Similarity Function.

References

2017

(Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Data_deduplication Retrieved:2017-6-18.
- In computing, data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. Related and somewhat synonymous terms are intelligent (data) compression and single-instance (data) storage. This technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency is dependent on the chunk size), the amount of data that must be stored or transferred can be greatly reduced. ^[1]
  This type of deduplication is different from that performed by standard file-compression tools, such as LZ77 and LZ78. Whereas these tools identify short repeated substrings inside individual files, the intent of storage-based data deduplication is to inspect large volumes of data and identify large sections – such as entire files or large sections of files – that are identical, in order to store only one copy of it. This copy may be additionally compressed by single-file compression techniques. For example, a typical email system might contain 100 instances of the same 1 MB (megabyte) file attachment. Each time the email platform is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; the subsequent instances are referenced back to the saved copy for deduplication ratio of roughly 100 to 1.

2009

(Dalvi et al., 2009) ⇒ Nilesh Dalvi, Ravi Kumar, Bo Pang, and Andrew Tomkins. (2009). “Matching Reviews to Objects Using a Language Model. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP 2009).
- QUOTE: Entity matching is a well-studied topic in databases. There are several approaches to entity matching: non-relational approaches, which consider pairwise attribute similarities between entities (Newcombe et al., 1959; Fellegi and Sunter, 1969), relational approaches, which exploit the relationships that exist between entities (Ananthakrishna et al., 2002; Kalashnikov et al., 2005), and collective approaches, which exploit the relationship between various matching decisions, (Bhattacharya and Getoor, 2007; McCallum and Wellner, 2004).

↑ "Understanding Data Deduplication" Druva, 2009. Retrieved 2013-2-13

[1] "Understanding Data Deduplication" Druva, 2009. Retrieved 2013-2-13

[1]

Record Deduplication Algorithm

References

2017

2009

Navigation menu

Search