2009 IncorporatingSiteLevelKnowledge

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Author Keywords

Web Forum, Sitemap, Incremental Crawling

Abstract

We study in this paper the problem of incremental crawling of web forums, which is a very fundamental yet challenging step in many web applications. Traditional approaches mainly focus on scheduling the revisiting strategy of each individual page. However, simply assigning different weights for different individual pages is usually inefficient in crawling forum sites because of the different characteristics between forum sites and general websites. Instead of treating each individual page independently, we propose a list-wise strategy by taking into account the site-level knowledge. Such site-level knowledge is mined through reconstructing the linking structure, called sitemap, for a given forum site. With the sitemap, posts from the same thread but distributed on various pages can be concatenated according to their timestamps. After that, for each thread, we employ a regression model to predict the time when the next post arrives. Based on this model, we develop an efficient crawler which is 260% faster than some state-of-the-art methods in terms of fetching new generated content; and meanwhile our crawler also ensure a high coverage ratio. Experimental results show promising performance of Coverage, Bandwidth utilization, and Timeliness of our crawler on 18 various forums.



References

,

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2009 IncorporatingSiteLevelKnowledgeJiang-Ming Yang
Rui Cai
Chunsong Wang
Hua Huang
Lei Zhang
Wei-Ying Ma
Incorporating Site-level Knowledge for Incremental Crawling of Web Forums: A List-wise StrategyKDD-2009 Proceedings10.1145/1557019.15571662009