Delta Lake

Delta Lake: High-Performance ACID Transactions for Your Data Lake

Michael Armbrust, Tathagata Das, Matei Zaharia et al. (Databricks) · 2020 · 총 5개 섹션 · 12개 문장

이렇게 사용하세요

1원문과 번역을 읽어보세요

2'상세 설명 펼치기'로 맥락을 파악하세요

3핵심 용어를 클릭해 정의를 확인하세요

Introduction

Data lakes have become popular because they allow storing vast amounts of raw data at low cost using commodity object storage systems such as Amazon S3. However, data lakes suffer from poor reliability and performance for data management workloads.

데이터 레이크¹는 Amazon S3와 같은 범용 오브젝트 스토리지² 시스템을 사용하여 방대한 양의 원시 데이터를 낮은 비용으로 저장할 수 있어 인기를 얻었다. 그러나 데이터 레이크¹는 데이터 관리 워크로드에 대한 낮은 신뢰성과 성능 문제를 겪는다.

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark² and big data workloads. Delta Lake sits on top of your existing data lake and is fully compatible with Apache Spark² APIs.

Delta Lake는 Apache Spark² 및 빅데이터 워크로드에 ACID 트랜잭션¹을 제공하는 오픈소스 스토리지 레이어이다. Delta Lake는 기존 데이터 레이크 위에 위치하며 Apache Spark² API와 완전히 호환된다.

Delta Lake solves common data reliability problems in data lakes: it supports ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.

Delta Lake는 데이터 레이크에서 흔히 발생하는 데이터 신뢰성 문제를 해결한다: ACID 트랜잭션, 확장 가능한 메타데이터 처리를 지원하고, 스트리밍과 배치 데이터 처리를 통합한다.

Delta Lake Design

Delta Lake stores data as Parquet¹ files in cloud object storage, and maintains a transaction log (Delta Log²) as a series of JSON files. The transaction log records every change made to the table.

Delta Lake는 클라우드 오브젝트 스토리지에 Parquet¹ 파일로 데이터를 저장하고, 일련의 JSON 파일로 트랜잭션 로그(Delta Log²)를 유지한다. 트랜잭션 로그는 테이블에 가해진 모든 변경 사항을 기록한다.

Delta Lake uses optimistic concurrency control to handle concurrent writes. When two writers try to modify the same data simultaneously, Delta Lake detects the conflict using the transaction log and resolves it without data corruption.

Delta Lake는 동시 쓰기를 처리하기 위해 낙관적 동시성 제어¹를 사용한다. 두 작성자가 동시에 같은 데이터를 수정하려 할 때, Delta Lake는 트랜잭션 로그를 사용하여 충돌을 감지하고 데이터 손상 없이 해결한다.

Delta Lake supports time travel: users can query a previous version of a table by specifying a timestamp or version number. This is useful for auditing, debugging, and reproducing machine learning experiments.

Delta Lake는 타임 트래블¹을 지원한다: 사용자는 타임스탬프나 버전 번호를 지정하여 테이블의 이전 버전을 쿼리할 수 있다. 이는 감사, 디버깅, 머신러닝 실험 재현에 유용하다.

Query Optimization

Delta Lake provides data skipping optimization that reads only the files relevant to a query, using min/max statistics stored in the transaction log. This can significantly reduce the amount of data read during query execution.

Delta Lake는 트랜잭션 로그에 저장된 최솟값/최댓값 통계를 사용하여 쿼리와 관련된 파일만 읽는 데이터 스키핑¹ 최적화를 제공한다. 이는 쿼리 실행 중 읽히는 데이터의 양을 크게 줄일 수 있다.

Delta Lake supports Z-Ordering¹, a technique to co-locate related data in the same set of files. Z-Ordering¹ on frequently queried columns can dramatically improve query performance by reducing the amount of data read.

Delta Lake는 관련 데이터를 같은 파일 집합에 함께 위치시키는 기법인 Z-Ordering¹을 지원한다. 자주 쿼리되는 컬럼에 Z-Ordering¹을 적용하면 읽히는 데이터의 양을 줄여 쿼리 성능을 크게 향상시킬 수 있다.

Use Cases at Databricks

Delta Lake supports UPSERT² (UPDATE + INSERT) operations via the MERGE INTO SQL statement. This is essential for Change Data Capture (CDC¹) pipelines, where changes from operational databases need to be reflected in the data lake.

Delta Lake는 MERGE INTO SQL 구문을 통해 UPSERT²(업데이트 + 삽입) 연산을 지원한다. 이는 운영 데이터베이스의 변경 사항을 데이터 레이크에 반영해야 하는 변경 데이터 캡처(CDC¹) 파이프라인에 필수적이다.

Delta Lake has been deployed at Databricks and is used by thousands of customers. At the largest deployments, Delta Lake manages tables with exabytes of data and billions of files.

Delta Lake는 Databricks에 배포되어 수천 명의 고객이 사용하고 있다. 가장 큰 배포 환경에서 Delta Lake는 엑사바이트¹ 규모의 데이터와 수십억 개의 파일로 구성된 테이블을 관리한다.

Conclusion

Delta Lake provides a solution to one of the most pressing challenges in modern data engineering: how to bring the reliability of traditional data warehouses to cost-effective cloud-based data lakes.

Delta Lake는 현대 데이터 엔지니어링에서 가장 시급한 과제 중 하나에 대한 해결책을 제공한다: 비용 효율적인 클라우드 기반 데이터 레이크에 전통적인 데이터 웨어하우스의 신뢰성을 어떻게 제공할 것인가.

By providing ACID transactions, time travel, schema enforcement, and unified batch and streaming processing on top of existing cloud storage, Delta Lake enables a new class of data architectures that are both reliable and cost-effective.

기존 클라우드 스토리지 위에 ACID 트랜잭션, 타임 트래블, 스키마 강제 적용¹, 통합 배치 및 스트리밍 처리를 제공함으로써, Delta Lake는 신뢰성과 비용 효율성을 모두 갖춘 새로운 데이터 아키텍처 클래스를 가능하게 한다.

원본 출처: Delta Lake: High-Performance ACID Transactions for Your Data Lake by Michael Armbrust, Tathagata Das, Matei Zaharia et al. (Databricks) (2020)

학습 목적으로 재구성된 콘텐츠입니다.