As you might have already seen, we recently announced the beta availability of our latest product, Hunk: Splunk Analytics for Hadoop. In this post I will cover some of the basic technology aspects of this new product and how they enable Hunk to perform analytics on top of raw data residing in Hadoop.
Introduction to Native Indexes
For those of you new to Splunk, please read this section to get a quick understanding of native indexes, as it will help differentiate them from virtual indexes. If you are already a Splunkguru please feel free to skip this section.
Whenever a Splunk indexer ingests raw data from any source (file, script, network, etc.), it performs some processing on that data and stores it in a format optimized for efficient keyword and time-based searches. We call a collection of data and associated metadata files that we lay on disk an “index.” Generally, one data source goes to one and only one index, however, you can create as many indexes as you need and can search many indexes concurrently in a single search.
To summarize, native indexes are data containers optimized for keyword and time-based searches. Additionally, indexes provide a natural way of implementing access controls, e.g. allow only users of group “ops” to access index call “os”. Another important feature of native indexes is data retention policies, e.g. age out data that is older than 30 days or when an index grows beyond a certain size.
Introduction to Virtual Indexes
Now, wouldn’t it be cool if Splunk’s Search Processing Language (SPL) would be able to address data sources from anywhere, not just its native indexes?
Well, this is exactly what virtual indexes allow Hunk to do. A virtual index, just like a native index, behaves as an addressable data container that can be referenced by a search. Just like native indexes, you can reference as many virtual indexes as you desire in a search and you can also mix native and virtual indexes together. This gives you the ability to correlate data no matter where it resides.
There are a few key differences between native and virtual indexes; for example, since the data that resides in the external system is not under direct management of Splunk, retention policies cannot be applied to the datasets that make up virtual indexes. Another equally important difference is the efficiency of certain searches, for example, data in external systems can be optimized for certain search patterns, or maybe not even optimized at all, which are different from those which Splunk users are accustomed to.
More formally, a virtual index is a search time concept that allows a Splunk search to access data and optionally push computation to external systems. Which brings us to the next topic, external result providers (ERPs).
Introduction to External Result Providers (ERPs)
In order for a search process to access data and push computation out to external systems, it needs to know specific details about the external system in question. One way to achieve this would be to add support for all the known systems where data resides. However, even enumerating the different flavors of external systems would be a daunting task.
We opted for the next best thing—an interface that a search process can use to communicate with a helper process that handles all the intricacies of interacting with the external system. We call this helper process an External Result Provider.
So to recap, Hunk is able to provide access to and perform analytics on data that resides in external system by encapsulating the data into addressable units using virtual indexes, while utilizing ERPs to handle the details of pushing down computations to the external system.
Learn more at www.splunk.com/bigdata
Continue reading the second part for more details …