Apache Kafka

Kafka: a Distributed Messaging System for Log Processing

Jay Kreps, Neha Narkhede, Jun Rao · 2011 · 총 7개 섹션 · 18개 문장

이렇게 사용하세요

1원문과 번역을 읽어보세요

2'상세 설명 펼치기'로 맥락을 파악하세요

3핵심 용어를 클릭해 정의를 확인하세요

Introduction

Log processing has become a critical component of the data pipeline for any internet-scale company.

로그 처리는 인터넷 규모의 모든 기업에서 데이터 파이프라인¹의 핵심 구성 요소가 되었다.

At LinkedIn, we found that we needed a low-latency messaging system for real-time applications as well as to load massive volumes of log data into Hadoop² and data warehouse systems for offline analysis.

LinkedIn에서 우리는 실시간 애플리케이션을 위한 저지연¹ 메시징 시스템과, 오프라인 분석을 위해 방대한 양의 로그 데이터를 Hadoop²과 데이터 웨어하우스³ 시스템에 적재할 수 있는 시스템이 동시에 필요하다는 것을 발견했다.

We designed Kafka to have the following key properties: high throughput to support high volume event feeds, built-in partition mechanism to allow the system to scale out, replication of data to handle fault tolerance, and a consumer model that allows each message to be consumed by multiple subscribers.

우리는 Kafka가 다음과 같은 핵심 특성을 갖도록 설계했다: 대용량 이벤트 피드를 지원하는 높은 처리량¹, 시스템 수평 확장을 위한 내장 파티션² 메커니즘, 장애 허용³을 위한 데이터 복제, 그리고 각 메시지를 여러 구독자가 소비할 수 있는 소비자 모델.

Architecture and Design Principles

A stream of messages of a particular type is defined as a topic. A producer can publish messages to a topic. The published messages are then stored at a set of servers called brokers. A consumer can subscribe to one or more topics from the brokers, and consume the subscribed messages by pulling data from the brokers.

특정 유형의 메시지 스트림을 토픽이라고 정의한다. 프로듀서²는 토픽에 메시지를 발행할 수 있다. 발행된 메시지는 브로커³라고 하는 서버 집합에 저장된다. 컨슈머⁴는 브로커³에서 하나 이상의 토픽을 구독하고, 브로커³로부터 데이터를 당겨(pull) 구독한 메시지를 소비할 수 있다.

Each partition of a topic corresponds to a logical log. Physically, a log is implemented as a set of segment files of approximately the same size. Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file.

토픽의 각 파티션은 논리적인 로그에 해당한다. 물리적으로 로그는 거의 동일한 크기의 세그먼트 파일² 집합으로 구현된다. 프로듀서가 파티션에 메시지를 발행할 때마다, 브로커는 단순히 마지막 세그먼트 파일²에 메시지를 추가(append)한다.

Unlike typical messaging systems, a message stored in Kafka doesn't have an explicit message id. Instead, each message is addressed by its logical offset in the log. This avoids the overhead of maintaining an index structure that maps the message ids to the actual message locations.

일반적인 메시징 시스템과 달리, Kafka에 저장된 메시지는 명시적인 메시지 ID를 갖지 않는다. 대신, 각 메시지는 로그에서의 논리적 오프셋¹으로 주소가 지정된다. 이는 메시지 ID를 실제 메시지 위치에 매핑하는 인덱스 구조를 유지하는 오버헤드를 방지한다.

We rely on the underlying file system page cache. This has the main benefit that we have much less overhead for cache management—since the OS has already handled this effectively—and the cache avoids the VM garbage collection issues present in in-process caches.

우리는 하위 파일 시스템 페이지 캐시¹에 의존한다. 이를 통해 캐시 관리 오버헤드가 크게 줄어드는 주요 이점이 있으며—OS가 이미 이를 효과적으로 처리하기 때문에—캐시는 인프로세스 캐시에 존재하는 VM 가비지 컬렉션² 문제도 방지한다.

Kafka uses a sendfile API available in Linux to optimize network transfer. Typically, data is transferred from a file to a socket in four data copies and two system calls. Using sendfile, the transfer is only done in two copies and one system call.

Kafka는 네트워크 전송을 최적화하기 위해 Linux에서 제공하는 sendfile API를 사용한다. 일반적으로 파일에서 소켓으로 데이터를 전송하려면 4번의 데이터 복사와 2번의 시스템 호출²이 필요하다. sendfile을 사용하면 2번의 복사와 1번의 시스템 호출²만으로 전송이 완료된다.

Consumer Design

Kafka has the concept of consumer groups. Each consumer group consists of one or more consumers that jointly consume a set of subscribed topics, i.e., each message is delivered to only one of the consumers within the group.

Kafka에는 컨슈머 그룹¹이라는 개념이 있다. 각 컨슈머 그룹¹은 하나 이상의 컨슈머로 구성되어 구독한 토픽 집합을 공동으로 소비한다. 즉, 각 메시지는 그룹 내의 컨슈머 중 오직 하나에게만 전달된다.

Unlike most other messaging systems, in Kafka, the information about how much each consumer has consumed is not maintained by the broker, but by the consumer itself. This design choice has the benefit of significantly reducing the bookkeeping needed at the broker.

대부분의 다른 메시징 시스템과 달리, Kafka에서는 각 컨슈머가 얼마나 소비했는지에 대한 정보를 브로커가 아닌 컨슈머 자신이 유지한다. 이 설계 선택은 브로커에서 필요한 기록 관리를 크게 줄이는 이점이 있다.

Kafka guarantees that messages from a single partition are delivered to a consumer in order. However, there is no guarantee on the ordering of messages from different partitions.

Kafka는 단일 파티션의 메시지가 컨슈머에게 순서대로 전달됨을 보장한다. 그러나 서로 다른 파티션에서 오는 메시지의 순서는 보장되지 않는다.

Message Delivery Semantics

Kafka guarantees at-least-once delivery. When a consumer process crashes, the new process may consume some messages that have already been processed. In this case, the consumer can use idempotent operations to handle duplicate messages.

Kafka는 최소 한 번(at-least-once) 전달을 보장한다. 컨슈머 프로세스가 충돌하면, 새 프로세스가 이미 처리된 일부 메시지를 다시 소비할 수 있다. 이 경우 컨슈머는 중복 메시지를 처리하기 위해 멱등성² 연산을 사용할 수 있다.

Kafka at LinkedIn

Kafka has been in production at LinkedIn since early 2010. We currently run over 2,000 brokers, and they facilitate over 1.4 trillion messages per day, with peak rates of over 20 million messages per second.

Kafka는 2010년 초부터 LinkedIn의 프로덕션¹ 환경에서 운영되어 왔다. 현재 2,000개 이상의 브로커를 운영하며, 하루 1.4조 건 이상의 메시지를 처리하고 초당 최대 2천만 건의 피크 속도를 달성하고 있다.

Kafka is used for both offline analysis (feeding data into Hadoop clusters for batch processing) and online applications (feeding data to real-time stream processing applications). This unified messaging platform significantly simplifies LinkedIn's overall data infrastructure.

Kafka는 오프라인 분석(배치 처리¹를 위해 Hadoop 클러스터에 데이터 제공)과 온라인 애플리케이션(실시간 스트림 처리² 애플리케이션에 데이터 제공) 모두에 사용된다. 이 통합 메시징 플랫폼은 LinkedIn의 전체 데이터 인프라를 크게 단순화했다.

Experimental Results

We compare Kafka with Apache ActiveMQ¹ v5.4 and RabbitMQ² v2.4. We set up a single producer and vary the number of messages and message sizes. The producer publishes a total of 10 million messages, each of 200 bytes.

우리는 Kafka를 Apache ActiveMQ¹ v5.4 및 RabbitMQ² v2.4와 비교했다. 단일 프로듀서를 설정하고 메시지 수와 크기를 변화시켰다. 프로듀서는 각각 200바이트인 총 1천만 건의 메시지를 발행했다.

Kafka producer achieves throughput of 50,000 messages/second for small messages (200 bytes). This is 2x higher than RabbitMQ and 3-4x higher than ActiveMQ. The primary reason is that Kafka batches multiple small messages together and sends them in a single request.

Kafka 프로듀서는 소형 메시지(200바이트)에 대해 초당 50,000건의 처리량을 달성한다. 이는 RabbitMQ보다 2배, ActiveMQ보다 3~4배 높다. 주된 이유는 Kafka가 여러 작은 메시지를 함께 묶어 단일 요청으로 전송하기 때문이다.

Conclusion

In this paper, we presented the design and implementation of Kafka, a distributed messaging system for collecting and delivering high volumes of log data with low latency. Kafka adopts a fundamentally different design from traditional messaging systems by relying on OS page cache, zero-copy transfer, and append-only logs.

이 논문에서 우리는 낮은 지연 시간으로 대용량 로그 데이터를 수집하고 전달하기 위한 분산 메시징 시스템인 Kafka의 설계와 구현을 소개했다. Kafka는 OS 페이지 캐시, Zero-Copy 전송, Append-Only 로그에 의존함으로써 기존 메시징 시스템과 근본적으로 다른 설계를 채택한다.

Kafka has been successfully deployed at LinkedIn as the backbone of its data pipeline. It handles over 1.4 trillion messages per day, and has proven to be a reliable, scalable, and extensible platform for real-time data processing.

Kafka는 LinkedIn의 데이터 파이프라인 백본으로 성공적으로 배포되었다. 하루 1.4조 건 이상의 메시지를 처리하며, 실시간 데이터 처리를 위한 안정적이고 확장 가능하며 확장 가능한 플랫폼임을 입증했다.

원본 출처: Kafka: a Distributed Messaging System for Log Processing by Jay Kreps, Neha Narkhede, Jun Rao (2011)

학습 목적으로 재구성된 콘텐츠입니다.