Data Lake Architecture Layers

Posted by Scarlett Dean

Data processing in the data lake architecture diagram can be loosely organized in the following conceptual model:

Ingestion Layer
The Ingestion Layer is tasked with ingesting raw data into the Data Lake. Modification of raw data is prohibited. Raw data can be ingested in batches or in real-time, and is organized in a logical folder structure. The Ingestion layer can accommodate data from different external sources, such as:

  1. Social networks
  2. IoT devices
  3. Wearable devices
  4. Data streaming devices

One of the advantages is that it can quickly ingest almost any type of data covering any system, including (but not limited to):

  1. Real-time data from connected health monitoring devices
  2. Video streams from security cameras
  3. Videos, photographs or geolocation data from mobile phones
  4. All types of telemetry data

Distillation Layer
The Distillation Layer converts the data stored by the Ingestion Layer to structured data for further analysis. In this layer, raw data is interpreted and transformed into structured data sets and subsequently stored as files or tables. The data is cleansed, denormalized, and derived at this stage, and then becomes uniform in terms of encoding, format, and data type.

Processing Layer
The Processing Layer runs user queries and advanced analytical tools on structured data. Processes can be run in real-time, as a batch, or interactively. Business logic is applied in this layer and data is consumed by analytical applications. This layer is also known as trusted, gold, or production-ready.

Insights Layer
The Insights Layer is the output interface, or the query interface, of the Data lake. It uses SQL or non-SQL queries to request and output data in reports or dashboards.

Unified Operations Layer
The Unified Operations Layer performs system monitoring and manages the system using workflow management, auditing, and proficiency management.

In some Data Lake implementations, a Sandbox Layer is included as well. As the name suggests, this layer is a place for data exploration by data scientists and advanced analysts. The sandbox layer is also referred to as the Exploration Layer or Data Science Layer.