Data Processing Pipeline

From GM-RKB
Jump to navigation Jump to search

A Data Processing Pipeline is a computing system that processes data.



References

2021

  • (Densmore, 2021) ⇒ James Densmore. (2021). “Data Pipelines Pocket Reference.” O'Reilly Media.
    • QUOTE: ... Data pipelines are sets of processes that move and transform data from various sources to a destination where new value can be derived. They are the foundation of analytics, reporting, and machine learning capabilities. The complexity of a data pipeline depends on the size, state, and structure of the source data as well as the needs of the analytics project. In their simplest form, pipelines may extract only data from one source such as a REST API and load to a destination such as a SQL table in a data warehouse. In practice, however, pipelines typically consist of multiple steps including data extraction, data preprocessing, data validation, and at times training or running a machine learning model before delivering data to its final destination. Pipelines often contain tasks from multiple systems and programming languages. What’s more, data teams typically own and maintain numerous data pipelines that share dependencies and must be coordinated. Figure 1-1 illustrates a simple pipeline.

      SERVER LOGS ==> S3 BUCKETS ==> PROCESS AND STRUCTURE ==> Amazon Redshift

2019

2016

  • https://medium.com/the-data-experience/building-a-data-pipeline-from-scratch-32b712cfb1db
    • QUOTE: ... a data pipeline is a set of actions that extract data (or directly analytics and visualization) from various sources. It is an automated process: take these columns from this database, merge them with these columns from this API, subset rows according to a value, substitute NAs with the median and load them in this other database. This is known as a “job”, and pipelines are made of many jobs. ... On the internet you’ll find countless resources about pipeline and warehouse infrastructure possibilities. You won’t find as many resources on the process to follow or on best practices. …

  1. Data Pipeline Development]Published by Dativa, retrieved 24 May, 2018