class: middle, center # Agenda Get started with [Apache Kafka.](https://kafka.apache.org) --- layout: true ####
Get started with Apache Kafka --- ## What IS apache Kafka Apache Kafka is * open-source * commit log based * stream-processing * fault tolerant storage * highly distributed * support of pub/sub and/or queue mechanisms * horizontally scalable * written in [scala and java](https://github.com/apache/kafka) --- ## Kafka Use Cases * messaging * user activity/behavior tracking * stream processing * event sourcing * commit log for distributed systems See [Use case](https://kafka.apache.org/uses) --- ## Who uses * Pinterest * Aidbnb * Cisco * Cloudflare * LinkedIn See [Powered by](https://kafka.apache.org/powered-by) --- ## History * Open source since 2011 under Apache 2 license * Originally developed by LinkedIn * Currently maintained by Apache Software Foundation --- ## Basic terminology **Broker** * A Kafka cluster consists of one or more servers (Kafka brokers) running Kafka **Record** * Consists of key, value and timestamp **Topic** * Category/feed name to which records are stored and published **Producer** * Processes that push records into Kafka topics within the broker **Consumer** * Pulls records off a Kafka topic --- ## Architecture  ## Basic concepts --- ## Broker Running a cluster with single Kafka broker is possible, but is not the prefered way. The main way to replicate your records across the cluster
Management of the brokers in the cluster is performed by Zookeeper.
--- ## Zookeeper [ZooKeeper](https://zookeeper.apache.org/) is a centralized service for * naming * maintaining configuration information * providing distributed synchronization * distributed process coordination * distributed lock * distributed consensus There may be multiple Zookeepers in a cluster.
Recommendation is three to five.
Keeping an odd number so that there is always a majority.
--- ## Topic * Kafka topic is the holder and organizer of the records * Topic has it own configurations to manipulate with records lifecycle * For instance, you can setup configurable retention period and retention policy * Any records within the topic has its offset --- ## Topic partition * Kafka topic can be divided into multiple partitions * The partition itself keeps its own offset sequence * In fact, the partitioning is the primary way of Kafka's parallelism * In Kafka, replication is implemented at partition level --- ## Low level Consumer * Consumer reads the messages from topic * Records can be consumed from arbitrary offset, but the common scenarios are "earliest" and "latest" * Configure isolation level with `isolation.level` property. --- ## High level consumer * Handling current consumer offset in low level consumers are hard, so Kafka has a concept called Consumer Group. * Multiple consumers can be grouped with Consumer Group. * Kafka keeps that of the consumer group, like start offset, end offset, messages behind etc. * Each time a new consumer joins the group, the consumptions is rebalanced. * The main way of configuring consumption parallelism is the number of consumers itself. * The topics are not pushing any notification to consumers, consumers are pulling images for their own. * The Kafka consumer may have it's own configurations like isolation level, offset reset, auto commit, pool interval, timeout, batch size, etc. --- ## Consumer constraints * Within the Consumer Group, at the given moment, only one consumer will be able to consume from the given partition * With this in mind, the max number of parallel consumers is the number of partitions * This way the consumer group will behave like queue Consumers in different consumer groups will behave like pub/sub. --- ## Producer Producer is the main way of pushing records to the Kafka topic. When the record published via producer, first it will be placed in leader partition, and then replicated with slave partitions.
There several durability options for producer acknowledge. * **acks=0** when set, producer not waits for the acknowledge. The record will be immediately added to the socket buffer and considered sent.the retries configuration will not take effect * **acks=1** this will ensure the record is placed in leader partition * **acks=all** this is the strongest durability option, which means the record is placed in all of the replicas in cluster. Records with the same key will be placed within the same partition. _`murmurhash(key) % number_of_partitions`_ function will be used for partition assignment. --- ## Kafka performance Kafka relies on the OS kernel * Uses zero-copy * Batch data records into chunks which allows for more efficient data compression and reduces I/O latency * Immutable sequential commit log, avoiding random disk access and fastens disk seeking * Horizontally scalable with sharding, it shards a topic log into partitions * Caching with Sequential I/O * Low latency native binary protocol, with less overhead * Not uses traditional tree data structures like BTree or LSM tree, it is just sequential data which is more convenient for queue systems --- ## Producer code snippet Here is a simple example of using the producer to send records with strings containing sequential numbers as the key/value pairs. ```java Properties props = new Properties(); props.put("bootstrap.servers", "localhost:9092"); props.put("acks", "all"); props.put("retries", 0); props.put("batch.size", 16384); props.put("buffer.memory", 33554432); props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer"); Producer
producer = new KafkaProducer<>(props); for (int i = 0; i < 100; i++) producer.send(new ProducerRecord
("my-topic", Integer.toString(i), Integer.toString(i))); producer.close(); ``` --- ## Consumer code snippet This example demonstrates a simple usage of Kafka's consumer api that relies on automatic offset committing. ```java Properties props = new Properties(); props.setProperty("bootstrap.servers", "localhost:9092"); props.setProperty("group.id", "test"); props.setProperty("enable.auto.commit", "true"); props.setProperty("auto.commit.interval.ms", "1000"); props.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); props.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); KafkaConsumer
consumer = new KafkaConsumer<>(props); consumer.subscribe(Arrays.asList("foo", "bar")); while (true) { ConsumerRecords
records = consumer.poll(Duration.ofMillis(100)); for (ConsumerRecord
record : records) System.out.printf("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value()); } ``` --- layout: true ####
Get started with Kafka --- ## Distributions Besides to Apache Kafka distribution, there is others with some minor/major additional components. * Confluent Platform * Cloudera Kafka * IBM Event Streams * Hortonworks * etc. --- layout: false class: middle, center ## Thanks :)