Consistent Weighted Sampling (CWS) Algorithm

From GM-RKB
Jump to navigation Jump to search

A Consistent Weighted Sampling (CWS) Algorithm is a sampling algorithm that is based on Jaccard Distance.



References

2017a

[math]\displaystyle{ generalized\;J\left(\mathcal{S},\mathcal{T}\right) =\dfrac{\sum_k \mathrm{min}\left(S_k, T_k\right)}{\sum_k \mathrm{max}\left(S_k, T_k\right)} }[/math] (1)
In order to efficiently compute the generalized Jaccard similarity, the Consistent Weighted Sampling (CWS) scheme has been proposed in Manasse et al. (2010).

Definition 2 (Consistent Weighted Sampling Manasse et al., 2010)). Given a weighted set $S = \{S_1, \cdots , S_n\}$, where $S_k \geq 0$ for $k \in \{1, \cdots , n\}$, Consistent Weighted Sampling (CWS) generates a sample $\left(k, y_k\right) : 0 \leq y_k \leq S_k$, which is uniform and consistent.

CWS has the following property

$Pr\left[CWS(\mathcal{S}) = CWS(\mathcal{T})\right] = generalized\;J\left(\mathcal{S}, \mathcal{T} \right)$

2017b

2016

2015

2010a

2010b


  1. T. H. Haveliwala, A. Gionis, and P. Indyk, “Scalable Techniques for Clustering the Web,” in WebDB, 2000, pp. 129–134