Hunk: Splunk Analytics for Hadoop Intro

Now that you know the basic technology behind Hunk, lets take a look at some of the features of Hunk and how they unlock the value of the data resting in Hadoop.

Defining the problem
More and more enterprises these days are storing massive amounts of data in Hadoop, with the goal that someday they will be able to analyze and gain insight from it and ultimately see a positive ROI. Since HDFS is a generic filesystem it can easily store all kinds of data, be it machine data, images, videos, documents etc, if you can put it in a file it can reside in HDFS. However, while storing the data in HDFS is relatively straightforward getting value out of this data is proving to be a daunting task for many. Unlocking value out of the data resting in Hadoop is the primary goal of Hunk.

What customers love about Splunk
So, let’s start with a few things that people love about Splunk, while I don’t claim to have a complete list, here’s a few that our customers boast about (in no particular order)

Immediate search feedback

Ability to process all kinds of data – i.e late binding schema

Ease of setup and rapid time to value

when designing Hunk we wanted to make sure that we preserve as many of the things that people love about Splunk and even add a few more. So, let’s take a look at how we were able to achieve each of those goals

Immediate feedback

Hadoop was designed to be a batch job processing system, ie you start a job and have no expectation to see any results back (except maybe some status reports) for a long time (ranging from tens of minutes to days). I am not going to argue the merits of immediate feedback, but we knew for a fact that anything “batch” was not going to fly with customers already accustomed to Splunk’s immediate feedback. Our first challenge: how can we provide immediate feedback to users when building on top of a system that was designed for the exact opposite?

Data processing modes

There are two widely used computation models for data processing:

1. Move data to the computation unit – yes, goes completely against what Hadoop stands for, but bear with me. The major key disadvantage to this model is that it has low throughput because of the large network bandwidth required. However, this model also has a very important property, namely low latency

2. Move computation to the data – this model is at the core of MapReduce and almost exclusively the only computation model used on Hadoop. The major advantage that this model has is data locality, leading to high throughput. However, the increase in throughput comes at the cost of latency – thus the batch nature of Hadoop. A MapReduce job (and anything built on top of them, Pig jobs, Hive queries etc) could takes tens of seconds all the way to minutes to even setup, let alone get scheduled and executed.

So, above I’ve described the two ends of the spectrum: low latency, low throughput and high latency, high throughput. What we’re actually looking for in solving our challenge is low latency, however we don’t want to give up on throughput, ie we need low latency and high throughput.

Now there’s nothing that says that one and only one of the above models can be used at a time. Do you see where I am going … maybe you’ve already thought of a solution, but here’s ours. In order to give users immediate feedback we start moving data to the compute unit, also known as a Search Head (we call this streaming mode) and concurrently we start moving computation to the data (a MapReduce job). While the MR job is setting up, starting and finally producing some results we display results from the streaming component, then as soon as some MapReduce tasks complete we stop streaming and consume the MR job results. Thus achieving low latency (via streaming) and high throughput (via MapReduce) – who said you can’t have it all? (I’m leave the costs of this method as an exercise for the reader.)

Late binding schema

Splunk uses a combination of early and late binding schema. Even though most users care about the flexibility of our search time schema binding, they’re usually unaware that there’s also some minimal index time schema applied to the data. When Splunk ingests data, it first breaks the data stream into events, performs timestamp extraction, source typing etc. Both of these schema applications are important and necessary to allow maximum flexibility in the type of data that can be processed by Hunk. However, in Hunk we could be asked to analyze data that did not necessarily end up in Hadoop via Splunk (or Hunk) – ie it’s either already resting in HDFS or getting there via some other mechanism, e.g. Flume, custom application etc. So, in Hunk we’ve implemented truly late binding schema – ie all the index time processing as well as all the search time processing are all applied to at search time. However, this does not mean that we are creating an index in HDFS, just the index time processing. We treat the HDFS data placed in virtual indexes in Hunk as a read only data source. For those already familiar with Splunk’s index time processing pipeline the following picture depicts the data flows in Hunk:

I mentioned that we wanted to preserve all the things that people love about Splunk and maybe even add more. The data processing pipeline is something where we’ve added something – before data is even processed by Hunk we allow you to plug in your own data preprocessor. The preprocessors have to be written in Java and have a chance to transform the data in some way before Hunk gets a chance to – they can vary in complexity from simple translators (say Avro to JSON) to as complex as doing image/video/document processing.

Ease of setup and rapid time to value
As I mentioned at the beginning of this post most enterprises are having a hard time getting value out of the data stored in Hadoop. So in Hunk we aimed at making the setup/installation and getting started experience as easy as possible. To this end, the setup is as simple as telling us (a) some key info about the Hadoop cluster, such as NameNode/JobTracker host and port, Hadoop client libraries to use when submitting MR jobs, Kerberos credentials etc and (b) creating virtual indexes that correspond to data in Hadoop.

In terms of providing a fast time to value we chose to allow users to run analytics searches against the data that rests in Hadoop without Hunk ever seeing/preprocessing the data! The reason for this is that we don’t want you to have to wait for potentially days until Hunk preprocesses the data before you can execute your first search. Some of Hunk Beta customers were able to setup Hunk and start running analytics searches against their Hadoop data within minutes of starting the setup process – yes I said minutes!

Stay tuned for the third part in which I will walk you through an example …

Hunk: Splunk Analytics for Hadoop Intro – Part 2

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112