2012 StratifiedKMeansClusteringovera

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Abstract

This paper focuses on the problem of clustering data from a hidden or a deep web data source. A key characteristic of deep web data sources is that data can only be accessed through the limited query interface they support. Because the underlying data set cannot be accessed directly, data mining must be performed based on sampling of the datasets. The samples, in turn, can only be obtained by querying the deep web databases with specific inputs.

We have developed a new stratified clustering method addressing this problem for a deep web data source. Specifically, we have developed a stratified k-means clustering method. In our approach, the space of input attributes of a deep web data source is stratified for capturing the relationship between the input and the output attributes. The space of output attributes of a deep web data source is partitioned into sub-spaces. Three representative sampling methods are developed in this paper, with the goal of achieving a good estimation of the statistics, including proportions and centers, within the sub-spaces of the output attributes.

We have evaluated our methods using two synthetic and two real datasets. Our comparison shows significant gains in estimation accuracy from both the novel aspects of our work, i.e., the use of stratification (5%-55%), and our and representative sampling methods (up to 54%).

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2012 StratifiedKMeansClusteringoveraGagan Agrawal
Tantan Liu
Stratified K-means Clustering over a Deep Web Data Source10.1145/2339530.23397052012