Data Engineering Task

A Data Engineering Task is a software engineering task that creates data-intensive system.

Context:
- It can (typically) be performed by a Data Engineer.
- It can involve:
  - Develop data pipelines and ETL using heterogeneous sources such as S3, Kafka, Flume, Sqoop, Spark Streaming, etc.
  - Develop ETL Applications using technologies such as Spark and AWS services.
  - Develop Batch Data Pipelines and Real-Time Data Streaming.
  - Develop scaleable, resilient, or always-on.
  - Develop solutions to enable auditing and data usage monitoring in our Cloud platforms.
  - scale Data Pipelines.
  - increase resiliency/availability of Data Pipelines.
  - Document Data Engineering Practice Guidelines.
  - Data Engineering Pipeline Review.
- …
Example(s):
- ETL Design and ETL Development (of ETL systems).
- …
Counter-Example(s):
- Data Science Task.
- ML Engineering Task.
- …
See: Data Steward Tasks, Data Custodianship Tasks.

References

2017

https://medium.freecodecamp.org/the-rise-of-the-data-engineer-91be18f1e603
- QUOTE: … Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science. …
  … Data integration, the practice behind integrating businesses and systems through the exchange of data, is as important and as challenging as its ever been. As Software as a Service (SaaS) becomes the new standard way for companies to operate, the need to synchronize referential data across these systems becomes increasingly critical. …
  … Data engineers are operating at a higher level of abstraction and in some cases that means providing services and tooling to automate the type of work that data engineers, data scientists or analysts may do manually. …
  … Here are a few examples of services that data engineers and data infrastructure engineer may build and operate.
  - data ingestion: services and tooling around “scraping” databases, loading logs, fetching data from external stores or APIs, …
  - metric computation: frameworks to compute and summarize engagement, growth or segmentation related metrics
  - anomaly detection: automating data consumption to alert people anomalous events occur or when trends are changing significantly
  - metadata management: tooling around allowing generation and consumption of metadata, making it easy to find information in and around the data warehouse
  - experimentation: A/B testing and experimentation frameworks is often a critical piece of company’s analytics with a significant data engineering component to it
  - instrumentation: analytics starts with logging events and attributes related to those events, data engineers have vested interests in making sure that high quality data is captured upstream
  - sessionization: pipelines that are specialized in understand series of actions in time, allowing analysts to understand user behaviors
- …
  … Just like software engineers, data engineers should be constantly looking to automate their workloads and building abstraction that allow them to climb the complexity ladder. While the nature of the workflows that can be automated differs depending on the environment, the need to automate them is common across the board.
  ...
- Required Skills
  - SQL mastery: if english is the language of business, SQL is the language of data. How successful of a business man can you be if you don’t speak good english? While generations of technologies age and fade, SQL is still standing strong as the lingua franca of data. A data engineer should be able to express any degree of complexity in SQL using techniques like “correlated subqueries” and window functions. SQL/DML/DDL primitives are simple enough that it should hold no secrets to a data engineer. Beyond the declarative nature of SQL, she/he should be able to read and understand database execution plans, and have an understanding of what all the steps are, how indices work, the different join algorithm and the distributed dimension within the plan.
  - Data modeling techniques: for a data engineer, entity-relationship modeling should be a cognitive reflex, along with a clear understanding of normalization, and have a sharp intuition around denormalization tradeoffs. The data engineer should be familiar with dimensional modeling and the related concepts and lexical field.
  - ETL design: writing efficient, resilient and “evolvable” ETL is key. ...
  - Architectural projections: like any professional in any given field of expertise, the data engineer needs to have a high level understanding of most of the tools, platforms, libraries and other resources at its disposal. The properties, use-cases and subtleties behind the different flavors of databases, computation engines, stream processors, message queues, workflow orchestrators, serialization formats and other related technologies. When designing solutions, she/he should be able to make good choices as to which technologies to use and have a vision as to how to make them work together.

Data Engineering Task

References

2017

Navigation menu

Search