2009 PervasiveParallelisminDataMinin

From GM-RKB

Jump to navigation Jump to search

(Daruru et al., 2009) ⇒ Srivatsava Daruru, Nena M. Marin, Matt Walker, and Joydeep Ghosh. (2009). “Pervasive Parallelism in Data Mining: Dataflow Solution to Co-clustering Large and Sparse Netflix Data.” In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2009). doi:10.1145/1557019.1557140

Subject Headings:

Notes

Categories and Subject Descriptors: H.2.8 Database Management: Database Applications, Data Mining.
General Terms: Algorithms, Design, Performance

Cited By

Quotes

Author Keywords

Co-Clustering, Dataflow, Predictive Modeling, Scalability

Abstract

All Netflix Prize algorithms proposed so far are prohibitively costly for large-scale production systems. In this paper, we describe an efficient dataflow implementation of a collaborative filtering (CF) solution to the Netflix Prize problem [1] based on weighted coclustering [5]. The dataflow library we use facilitates the development of sophisticated parallel programs designed to fully utilize commodity multicore hardware, while hiding traditional difficulties such as queuing, threading, memory management, and deadlocks.

The dataflow CF implementation first compresses the large, sparse training dataset into co-clusters. Then it generates recommendations by combining the average ratings of the co-clusters with the biases of the users and movies. When configured to identify 20x20 co-clusters in the Netflix training dataset, the implementation predicted over 100 million ratings in 16.31 minutes and achieved an RMSE of 0.88846 without any fine-tuning or domain knowledge. This is an effective real-time prediction runtime of 9.7 us per rating which is far superior to previously reported results. Moreover, the implemented co-clustering framework supports a wide variety of other large-scale data mining applications and forms the basis for predictive modeling on large, dyadic datasets [4, 7].

References

,

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2009 PervasiveParallelisminDataMinin	Srivatsava Daruru Matt Walker Joydeep Ghosh Nena M. Marin			Pervasive Parallelism in Data Mining: Dataflow Solution to Co-clustering Large and Sparse Netflix Data		KDD-2009 Proceedings		10.1145/1557019.1557140		2009

Retrieved from "http://www.gabormelli.com/RKB/index.php?title=2009_PervasiveParallelisminDataMinin&oldid=830277"

Facts

... more about "2009 PervasiveParallelisminDataMinin"

Srivatsava Daruru +, Nena M. Marin +, Matt Walker + and Joydeep Ghosh +

10.1145/1557019.1557140 +

Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining +

Pervasive Parallelism in Data Mining: Dataflow Solution to Co-clustering Large and Sparse Netflix Data +

2009 +