How to: Splunk Analytics for Hadoop on Amazon EMR.

**Please note: The following is an example approach outlining a functional Splunk Analytics for Hadoop environment running on AWS EMR. Please talk to your local Splunk team to determine the best architecture for you.

Using Amazon EMR and Splunk Analytics for Hadoop to explore, analyze and visualize machine data

Machine data can take many forms and comes from a variety of sources; system logs, application logs, service and system metrics, sensors data etc. In this step-by-step guide, you will learn how to build a big data solution for fast, interactive analysis of data stored in Amazon S3 or Hadoop. This hands-on guide is useful for solution architects, data analysts and developers.

You will need:

An Amazon EMR Cluster
A Splunk Analytics for Hadoop Instance
Amazon S3 bucket with your data
- Data can also be in Hadoop Distributed File System (HDFS)

To get started, go into Amazon EMR from the AWS management console page:

From here, you can manage your existing clusters, or create a new cluster. Click on ‘Create Cluster’:

This will take you to the configuration page. Set a meaningful cluster name, enable logging (if required) to an existing Amazon S3 bucket, and set the launch mode to cluster:

Under software configuration, choose Amazon EMR 5.x as per the following:

Several of the applications included are not required to run Splunk Analytics for Hadoop, however they may make management of your environment easier.

Choose the appropriate instance types, and number of instances according to your requirements:

** please note that Splunk recommends Hadoop nodes to be 8 cores / 16 vCPU. The M3.xlarge instances were used for demonstration here only.

For security and access settings, choose those appropriate to your deployment scenario. Using the defaults here can be an appropriate option:

Click ‘Create Cluster’.

This process may take some time. Keep an eye on the Cluster list for status changes:

When the cluster is deployed and ready:

Clicking on the cluster name will provide the details of the set up:

At this point, browse around the platform, and get familiar with the operation of the EMR cluster. Hue is a good option for managing the filesystem, and the data that will be analyzed through Splunk Analytics for Hadoop.

Configure Splunk Analytics for Hadoop on AWS AMI instance to connect to EMR Cluster

Installing Splunk Analytics for Hadoop on a separate Amazon EC2 instance, removed from yourAmazon EMR cluster is the Splunk recommended architectural approach. In order to configure this setup, we run up a Splunk 6.5 AMI from the AWS Marketplace, and then add the necessary Hadoop,Amazon S3 and Java libraries. This last step is further outlined on Splunk docs at -http://docs.splunk.com/Documentation/HadoopConnect/1.2.3/DeployHadoopConnect/HadoopCLI

To kick off, launch a newAmazon EC2 instance from the AWS Management Console:

Search the AWS Marketplace for Splunk and select the Splunk Enterprise 6.5 AMI:

Choose an instance size to suit your environment and requirements:

**please note that Splunk recommends minimum hardware specs for a production deployment. More details at http://docs.splunk.com/Documentation/Splunk/6.5.0/Installation/Systemrequirements

From here you can choose to further customize the instance (should you want more storage, or to add custom tags), or just review and launch:

Now, you’ll need to add the Hadoop,Amazon S3 and Java client libraries to the newly deployed Splunk AMI. To do this, first grab the versions from theAmazon EMR master node for each, to ensure that you are matching the libraries on your Splunk server. Once you have them, install them on the Splunk AMI:

Move this to /usr/bin and unpack it.

In order to search theAmazon S3 data, we need to ensure we have access to the S3 toolset. Add the following line to the file /usr/bin/hadoop/etc/hadoop/hadoop-env.sh:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop‌/tools/lib/*

Finally, we need to setup the necessary authentication to access Amazon S3 via our new virtual index connection. You’ll need a secret key ID and access key from your AWS Identity and Access Management (IAM) setup. In this instance, we have setup these credentials for an individual AWS user:

Ensure that when you create the access key, you record the details. You then need to include these in the file located at /usr/bin/hadoop/etc/hadoop/hdfs-site.xml. Include the following within the <configuration> tag:

<property>
   <name>fs.s3.awsAccessKeyId</name>
   <value>xxxx</value>
</property>
<property>
   <name>fs.s3.awsSecretAccessKey</name>
   <value>xxxx</value>
</property>
<property>
   <name>fs.s3n.awsAccessKeyId</name>
   <value>xxxx</value>
</property>
<property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>xxxx</value>
</property>

You need to include the s3n keys, as that is the mechanism we will use to connect to the Amazon s3 dataset.

Create data to analyze with Splunk Analytics for Hadoop

We have multiple options for connecting to data for investigation within Splunk Analytics for Hadoop. In this guide, we will explore adding files to HDFS via Hue, and connecting to an existing Amazon S3 bucket to explore data.

Option 1 – S3

From the AWS Management Console, go into Amazon S3, and create a new bucket:

Give the bucket a meaningful name, and specify the region in which you would like it to exist:

Click create, and add some files to this new bucket as appropriate. You can choose to add the files to the top level, or create a directory structure:

The files or folders that you create within the Amazon S3 bucket need to have appropriate permissions to allow the Splunk Analytics for Hadoop user to connect and view them. Set these to allow ‘everyone’ read access, and reduce this scope to appropriate users or roles after testing.

Option 2 – HDFS

**this option is only relevant if you DO NOT want to leverage Amazon S3 for data storage. You’ll need to ensure that you have assigned appropriate disk space on the Hadoop nodes to leverage this method.

First, let’s create or upload some data in HDFS. First we will need a user in HDFS. We will use root, however this may not be the appropriate user in your environment. From the master node:

hadoop fs –mkdir hdfs://masternodeaddress:8020/user/root

hadoop fs –chown root:root hdfs://masternodeaddress:8020/user/root

Now, use hue to upload data to this new directory. Login to hue:

http://masternodeaddress:8888

Select the file browser, navigate to the /user/root directory and create a ‘data’ directory. Navigate into this directory, and then upload some files for use.

This should result in data being available in the Hadoop FS:

Set up Splunk Analytics for Hadoop for data analysis

To proceed, first you’ll need to grab some parameters from the Hadoop nodes:

Collect Hadoop and Yarn variables:

Java Home = type ‘which java’ = /usr/bin/java
Hadoop home = type ‘which hadoop’ = /usr/bin/hadoop
Hadoop version = type ‘hadoop version’ = hadoop 2.7.2-amzn-3
Name node port = In a browser go to http://masternodeaddress:50070 (or click on HDFS name node in the EMR management console screen)
Yarn resource manager scheduler address= In a browser go to http://masternodeaddress:8088/conf (or click on ‘resource manager’ in the EMR management console screen) = look for ‘yarn.resourcemanager.scheduler.address’ = x.x.x:8030
Yarn resource manager address= In a browser go to http://masternodeaddress:8088/conf (or click on ‘resource manager’ in the EMR management console screen) = look for ‘yarn.resourcemanager.address’ = x.x.x:8050

Now, we need to verify that the name node is correct. You can do this by executing this command:

hadoop fs –ls hdfs://masternodeaddress:8020/user/root/data

Now we can configure our Virtual Provider in Splunk. To do this, go to settings, and then Virtual Indexes: