Data Stream Processing 3rd-Party Platform

From GM-RKB
Jump to navigation Jump to search

A Data Stream Processing 3rd-Party Platform is a data processing platform that can be used to build data stream processing systems (to solve data stream processing task)s.



References

2016

  • https://databricks.com/blog/2016/07/28/continuous-applications-evolving-streaming-in-apache-spark-2-0.html
    • QUOTE: Most streaming engines focus on performing computations on a stream: for example, one can map a stream to run a function on each record, reduce it to aggregate events by time, etc. However, as we worked with users, we found that virtually no use case of streaming engines only involved performing computations on a stream. Instead, stream processing happens as part of a larger application, which we’ll call a continuous application. Here are some examples:
      • Updating data that will be served in real-time. For instance, developers might want to update a summary table that users will query through a web application. In this case, much of the complexity is in the interaction between the streaming engine and the serving system: for example, can you run queries on the table while the streaming engine is updating it? The “complete” application is a real-time serving system, not a map or reduce on a stream.
      • Extract, transform and load (ETL). One common use case is continuously moving and transforming data from one storage system to another (e.g. JSON logs to an Apache Hive table). This requires careful interaction with both storage systems to ensure no data is duplicated or lost — much of the logic is in this coordination work.
      • Creating a real-time version of an existing batch job. This is hard because many streaming systems don’t guarantee their result will match a batch job. For example, we’ve seen companies that built live dashboards using a streaming engine and daily reporting using batch jobs, only to have customers complain that their daily report (or worse, their bill!) did not match the live metrics.
      • Online machine learning. These continuous applications often combine large static datasets, processed using batch jobs, with real-time data and live prediction serving.