Apache Flink Data Processing Framework
- See: HPC, Apache Hadoop MapReduce.
- (Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/Apache_Flink Retrieved:2017-4-25.
- Apache Flink is an open source stream processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming dataflow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner.  Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.  Flink provides a high-throughput, low-latency streaming engine as well as support for event-time processing and state management. Flink applications are fault-tolerant in the event of machine failure and support exactly-once semantics. Programs can be written in Java, Scala, Python, and SQL and are automatically compiled and optimized  into dataflow programs that are executed in a cluster or cloud environment.  Flink does not provide its own data storage system and provides data source and sink connectors to systems such as Amazon Kinesis, Apache Kafka, HDFS, Apache Cassandra, and ElasticSearch.
- QUOTE: … At first glance, Flink and Spark would appear to be the same. The main difference is Flink was built from the ground up as a streaming product. Spark added Streaming onto their product later. ...
- Fast: State-of-the art performance exploiting in-memory processing and data streaming.
- Reliable: Flink is designed to perform very well even when the cluster's memory runs out.
- Expressive: Write beautiful, type-safe code in Java and Scala. Execute it on a cluster.
- Easy to use: Few configuration parameters required. Cost-based optimizer built in.
- Scalable: Tested on clusters of 100s of machines, Google Compute Engine, and Amazon EC2.
- Hadoop-compatible: Flink runs on YARN and HDFS and has a Hadoop compatibility package.
- Flink is a data processing system and an alternative to Hadoop’s MapReduce component. It comes with its own runtime, rather than building on top of MapReduce. As such, it can work completely independently of the Hadoop ecosystem. However, Flink can also access Hadoop’s distributed file system (HDFS) to read and write data, and Hadoop’s next-generation resource manager (YARN) to provision cluster resources. Since most Flink users are using Hadoop HDFS to store their data, we ship already the required libraries to access HDFS.
- Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The Stratosphere platform for big data analytics. The VLDB Journal 23, 6 (December 2014), 939-964. DOI
- Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. 2012. Spinning fast iterative data flows. Proc. VLDB Endow. 5, 11 (July 2012), 1268-1279. DOI
- Cite error: Invalid
<ref>tag; no text was provided for refs named
- Fabian Hueske, Mathias Peters, Matthias J. Sax, Astrid Rheinländer, Rico Bergmann, Aljoscha Krettek, and Kostas Tzoumas. 2012. Opening the black boxes in data flow optimization. Proc. VLDB Endow. 5, 11 (July 2012), 1256-1267. DOI
- Daniel Warneke and Odej Kao. 2009. Nephele: efficient parallel data processing in the cloud. In: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS '09). ACM, New York, NY, USA, , Article 8 , 10 pages. DOI