Data Processing Pipeline
(Redirected from Pipeline (computing))
Jump to navigation
Jump to search
A Data Processing Pipeline is a computing system that processes data.
- Context:
- It can range from being a Batch Data Processing Pipeline to being a Real-Time Data Processing Pipeline.
- …
- Example(s):
- Counter-Example(s):
- a Computing Platform.
- a RISC Pipeline.
- an HTTP Pipeline.
- See: Computing Buffer, Message Queue.
References
2021
- (Densmore, 2021) ⇒ James Densmore. (2021). “Data Pipelines Pocket Reference.” O'Reilly Media.
- QUOTE: ... Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived. They are the foundation of analytics, reporting, and machine learning capabilities. The complexity of a data pipeline depends on the size, state, and structure of the source data as well as the needs of the analytics project. In their simplest form, pipelines may extract only data from one source such as a REST API and load to a destination such as a SQL table in a data warehouse. In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination. Pipelines often contain tasks from multiple systems and programming languages. What’s more, data teams typically own and maintain numerous data pipelines that share dependencies and must be coordinated. Figure 1-1 illustrates a simple pipeline.
SERVER LOGS ==> S3 BUCKETS ==> PROCESS AND STRUCTURE ==> Amazon Redshift
- QUOTE: ... Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived. They are the foundation of analytics, reporting, and machine learning capabilities. The complexity of a data pipeline depends on the size, state, and structure of the source data as well as the needs of the analytics project. In their simplest form, pipelines may extract only data from one source such as a REST API and load to a destination such as a SQL table in a data warehouse. In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination. Pipelines often contain tasks from multiple systems and programming languages. What’s more, data teams typically own and maintain numerous data pipelines that share dependencies and must be coordinated. Figure 1-1 illustrates a simple pipeline.
2019
- (Wikipedia, 2019) ⇒ https://en.wikipedia.org/wiki/Pipeline_(computing) Retrieved:2019-4-8.
- In computing, a pipeline, also known as a data pipeline, [1] is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.
Computer-related pipelines include:
- Instruction pipelines, such as the classic RISC pipeline, which are used in central processing units (CPUs) and other microprocessors to allow overlapping execution of multiple instructions with the same circuitry. The circuitry is usually divided up into stages and each stage processes a specific part of one instruction at a time, passing the partial results to the next stage. Examples of stages are instruction decode, arithmetic/logic and register fetch. They are related to the technologies of superscalar execution, operand forwarding, speculative execution and out-of-order execution.
- Graphics pipelines, found in most graphics processing units (GPUs), which consist of multiple arithmetic units, or complete CPUs, that implement the various stages of common rendering operations (perspective projection, window clipping, color and light calculation, rendering, etc.).
- Software pipelines, which consist of a sequence of computing processes (commands, program runs, tasks, threads, procedures, etc.), conceptually executed in parallel, with the output stream of one process being automatically fed as the input stream of the next one. The Unix system call pipe is a classic example of this concept.
- HTTP pipelining, the technique of issuing multiple HTTP requests through the same TCP connection, without waiting for the previous one to finish before issuing a new one.
- Some operating systems may provide UNIX-like syntax to string several program runs in a pipeline, but implement the latter as simple serial execution, rather than true pipelining — namely, by waiting each program to finish before starting the next one.
- In computing, a pipeline, also known as a data pipeline, [1] is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.
2016
- https://medium.com/the-data-experience/building-a-data-pipeline-from-scratch-32b712cfb1db
- QUOTE: ... a data pipeline is a set of actions that extract data (or directly analytics and visualization) from various sources. It is an automated process: take these columns from this database, merge them with these columns from this API, subset rows according to a value, substitute NAs with the median and load them in this other database. This is known as a “job”, and pipelines are made of many jobs. ... On the internet you’ll find countless resources about pipeline and warehouse infrastructure possibilities. You won’t find as many resources on the process to follow or on best practices. …
- ↑ Data Pipeline Development]Published by Dativa, retrieved 24 May, 2018