Splunking Kafka At Scale

At Splunk, we love data and we’re not picky about how you get it to us. We’re all about being open, flexible and scaling to meet your needs. We realize that not everybody has the need or desire to install the Universal Forwarder to send data to Splunk. That’s why we created the HTTP Event Collector. This has opened the door to getting a cornucopia of new data sources into Splunk, reliably and at scale.

We’re seeing more customers in Major Accounts looking to integrate their Pub/Sub message brokers with Splunk. Kafka is the most popular message broker that we’re seeing out there but Google Cloud Pub/Sub is starting to make some noise. I’ve been asked multiple times for guidance on the best way to consume data from Kafka.

In the past I’ve just directed people to our officially supported technology add-on for Kafka on Splunkbase. It works well for simple Kafka instances, but if you have a large Kafka cluster comprised of high throughput topics with tens to hundreds of partitions, it has its limitations. The first is that management is cumbersome. It has multiple configuration topologies and requires multiple collection nodes to facilitate data collection for the given topics. The second is that each data collection node is a simple consumer (single process) with no ability to auto-balance across the other ingest nodes. If you point it to a topic it will take ownership of all partitions on the topic and consumes via round-robin across the partitions. If your busy topic has many partitions, this won’t scale well and you’ll lag reading the data. You can scale by creating a dedicated input for each partition in the topic and manually assigning ownership of a partition number to each input, but that’s not ideal and creates a burden in configuration overhead. The other issue is that if any worker process dies, the data won’t get read for its assigned partition until it starts back up. Lastly, it requires a full Splunk instance or Splunk Heavy Forwarder to collect the data and forward it to your indexers.

Due to the limitations stated above, a handful of customers have created their own integrations. Unfortunately, nobody has shared what they’ve built or what drivers they’re using. I’ve created an integration in Python using PyKafka, Requests and the Splunk HTTP Event Collector. I wanted to share the code so anybody can use it as a starting point for their Kafka integrations with Splunk. Use it as is or fork it and modify it to suit your needs.

Why should you consider using this integration over the Splunk TA? The first is scalability and availability. The code uses a PyKafka balanced consumer. The balanced consumer coordinates state for several consumers who share a single topic by talking to the Kafka broker and directly to Zookeeper. It registers a consumer group id that is associated with several consumer processes to balance consumption across the topic. If any consumer dies, a rebalance across the remaining available consumers will take place which guarantees you will always consume 100% of your pipeline given available consumers. This allows you to scale, giving you parallelism and high availability in consumption. The code also takes advantage of multiple CPU cores using Python multiprocessing. You can spawn as many consumers as available cores to distribute the workload efficiently. If a single collection node doesn’t keep up with your topic, you can scale horizontally by adding more collection nodes and assigning them to the same consumer group id.

The second reason you should consider using it is the simplified configuration. The code uses a YAML config file that is very well documented and easy to understand. Once you have a base config for your topic, you can lay it over all the collection nodes using your favorite configuration management tool (Chef, Puppet, Ansible, et al.) and modify the number of workers according to the number of cores you want to allocate to data collection (or set to auto to use all available cores).

The other piece you’ll need is a highly available HTTP Event Collector tier to receive the data and forward it on to your Splunk indexers. I’d recommend scenario 3 outlined in the distributed deployment guide for the HEC. It’s comprised of a load balancer and a tier of N HTTP Event Collector instances which are managed by the deployment server.

The code utilizes the new HEC RAW endpoint so anything that passes through will go through the Splunk event pipeline (props and transforms). This will require Splunk version >= 6.4.0.

Once you’ve got your HEC tier configured, inputs created and your Kafka pipeline flowing with data you’re all set. Just fire up as many instances as necessary for the topics you want to Splunk and you’re off to the races! Feel free to contribute to the code or raise issues and make feature requests on the Github page.

Get the code

Splunking Kafka At Scale

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112