This is the first of a two part series on implementing Box Plots in Splunk for security use cases.
Analyzing complex data is difficult, which is why people use Splunk. Sometimes patterns in data are not obvious, so it takes various ways of looking at aggregate reports and multiple charts to ascertain the important information buried in the data. A common tool in a data analyst’s arsenal is a box plot. A box plot, also called a box and whisker plot, is a visual method to quickly ascertain the variability and skew of data, as well as the median. For more about using and reading box plots, read the excellent and succinct post by Nathan Yau of the Flowing Data blog “How to Read and Use a Box-and-Whisker Plot.”
With Splunk Enterprise 6.4, there is a new framework for custom visualizations. For anyone interested in building their own, there are extensive and well written docs on building custom visualizations, and they are excellent tutorials and reference materials for anyone building new visualization apps.
The most difficult part of building visualizations is not creating the Splunk app, especially with the excellent documentation and great community support from Answers, IRC, and Slack (signup online). The basic steps are (largely distilled from http://docs.splunk.com/Documentation/Splunk/6.4.0/AdvancedDev/CustomVizTutorial):
- Create working visualization outside the Splunk framework. This usually is as simple as an HTML file to call the JavaScript and the JavaScript code itself, combined with some example input of some sort.
- Download and install a fresh, unused Splunk Enterprise 6.4 (or newer) instance on a test machine or a workstation. Do not install any other apps or add-ons. This provides a clean and uncluttered environment for testing the app without potential conflicts or problems from a production environment. This, also, allows for restarting Splunk, reloading configs, or removing and reinstalling the app or Splunk itself at any time during the development process.
- Download the example app from the tutorial.
- Install the Viz_tutorial_app in the Splunk test instance.
- Rename the app directory to match the new app being developed.
- Edit the config files as directed in the tutorial for the new app name and other settings.
- Perform JavaScript magic to create the working visualization within Splunk. This post will help with this process.
- Optionally (and preferably) add user definable options.
- Test and package the app.
The difficult part of building a visualization app is the JavaScript code drawing the chart mentioned in steps one and seven above. Most people start with pre-written libraries to save the arduous work of writing the code from a blank screen, which sometimes makes step one easier. However, even when these libraries work in their native form perfectly well, most of them require some massaging before they work correctly within the Splunk Custom Visualization framework.
The most common visualizations are built on top of the Data Driven Documents (D3) library D3.js. The bulk of existing D3 applications are designed to work with some raw, unprocessed data, often supplied by a CSV file or other static source. Because of this, the data input usually must be altered within the JavaScript when building a Splunk custom visualization. In addition, without an analytics engine supplying the data, most D3 applications are written to perform all the mathematical calculations on the static data sources. With Splunk’s superior ability to perform a wide range of calculations on vast amounts of data, it behooves a visualization developer to alter the JavaScript to accept pre-processed data.
Following this paradigm, the D3 Box Plot application started with Jens Grubert’s D3.js Boxplot with Axes and Labels code. Grubert’s code runs well using the static format for a CSV as data input, but the JavaScript is hardcoded for a specific number of columnar inputs from the file. Altering the code is required to change the number of columns and, therefore, the number of box plots displayed in a single chart. Also, the source app performs all the calculations within the JavaScript using raw number inputs from the CSV data file.
Splunk supplies the pre-calculated data for an arbitrary number of box plots needed and uses the D3 Box Plot app only for display. Therefore, the original code required significant changes to remove the calculations; alter the inputs to accept the pre-calculated data only needed to draw the visual elements; and to work within the Splunk Custom Visualization framework.
Kevin Kuchta provided significant assistance in reworking the JavaScript from Grubert’s original code into something meeting the data input requirements and removing the mathematical functions to operate as a standalone app. This was needed to ensure the application can perform all the needed functions before it was converted to work within Splunk. Some of the original code is commented out in case it becomes useful in future editions of the published app and some has been removed entirely.
Grubert’s code uses two scripts. One is embedded in an HTML file that is used to call the code with a browser, and the other is a standalone file called box.js. During the development phases of altering the code to run outside of Splunk, the embedded script in the HTML was moved to an outside file called boxparse.js and sourced within the HTML.
Setting up the Development Environment
The original code in the HTML file that calls the visualization library used (d3.min.js) and the box.js file and looks like:
<script src=”http://d3js.org/d3.v3.min.js”></script>
<script src=”d3.v3.min.js”></script>
<script src=“box.js”></script>
After pulling the code from between the <script></script> tags immediately following those three lines and putting it into the boxparse.js file, it was sourced by adding:
<script src=“boxparse.js”></script>
To test this locally without cross-domain errors in Chrome (the browser of choice for debugging JavaScript today), an in-place web server was run using port 9000 (to not interfere with Splunk running locally on 8000) on the local machine from the directory holding the box plot code using:
python -m SimpleHTTPServer 9000
This allows for rapid testing using Chrome pointed at http://localhost:9000/.
Changing Inputs and Removing Data Calculations
The next step was to remove the calculation code and altering the inputs to both be dynamic in the number of different data sets for a variable number of box plots to display and to accept pre-calculated values for the final data required to create a box plot.
The required values to create a box plot are:
- median
- min
- max
- lower quartile (used for lower bound of the box)
- upper quartile (used for upper bound of the box)
- interquartile range (the difference between the upper and lower quartiles and called iqr in the app)
- list of outlier values (not used in the initial version of the Box Plot Viz app)
- category name or label
The data parsing is in the boxparse.js code taken from the HTML file. This made the process simple to remove the lines starting with:
// parse in the data
d3.csv(“data.csv”, function(error, csv) {
and ending with:
if (rowMax > max) max = rowMax;
if (rowMin < min) min = rowMin;
});
This section of the original code both reads the input CSV file and performs calculations on the data to find min and max values for each set. All of this code was removed and min and max are now set using:
var yMinValues = [];
for (minind = 0; minind < data.length; minind++) {
yMinValues.push(data[minind][2][0]);
}
var yMin = Math.min(…yMinValues);var yMaxValues = [];
for (maxind = 0; maxind < data.length; maxind++) {
yMaxValues.push(data[maxind][2][1]);
}
var yMax = Math.max(…yMaxValues);
This sets yMin and yMax as new variables for clarity in naming, rather than using the original code’s min and max variable names. This required changing the y-axis from using:
.domain([min, max])
to using:
.domain([yMin, yMax])
The iqr() function to calculate the interquartile range was removed entirely, and references to such were replaced with the iqr variable supplied by the external data (to prepare for conversion to Splunk Custom Visualization).
Another notable change was to pass the yMin and yMax variables to the d3.box() function thusly:
var chart = d3.box({“yMax”:yMax,”yMin”:yMin})
This sends the data as part of the config object sent to d3.box() in box.js. To use these in box.js, the following was added to the bottom of the d3.box() function:
yMin = config.yMin,
yMax = config.yMax;
During testing, an array was created in boxparse.js to include data for testing. This was better than using an external file because it simulates how the data will come from splunk in the variable named data. Arbitrarily, the decision was made to use an ordered, index array like:
var newFakeData = [
// Column Quartiles Whiskers Outliers Min Max
[“somedata”, [ 10, 20, 30 ], [5, 45], [1, 2], 0, 200],
[“otherdata”, [ 15, 25, 30 ], [5, 65], [], 2, 150],
];
Although outlier support was not included in the initial version due to the complexity of Splunk searches being difficult for the average user, the ability to read and draw them is still in the code. They are merely set to null on input.
The last part, converting box.js to use the new data inputs rather than the internally calculated values was a fairly lengthy but not difficult process. It required careful review of the code to see where all the values were submitted to the various drawing functions or setting variables from calculations. In the places where there were calculations, a simple variable assignment replaced the original code.
For example, the original box.js set min and max with:
var g = d3.select(this),
n = d.length,
min = d[0],
max = d[n – 1];
However, the new box.js simple does:
var g = d3.select(this),
min = data[4],
max = data[5];
In the cases where values were calculated in separate functions, those functions were complete replaced with variable assignments.
For example, the original box.js set whiskerData using:
// Compute whiskers. Must return exactly 2 elements, or null.
var whiskerIndices = whiskers && whiskers.call(this, d, i),
whiskerData = whiskerIndices && whiskerIndices.map(function(i) { return d[i]; });
Yet, the new box.js uses the supplied array of whisker values using the inputed data with:
whiskerData = data[2];
This method was used on the rest of the required variables needed to build a box plot.
After these changes, the box plot loaded via the HTML file and the local test HTTP server.
Conversion to a Splunk App
The next step was to convert the stand alone application to work in Splunk as a Custom Visualization. Kyle Smith, a developer for the Splunk partner Aplura, and author of Splunk Developer’s Guide, provided excellent and thorough guidance in this process. His personal advice and assistance combined with his book were instrumental in the success of this conversion. There were numerous display issues once the app was built in Splunk. This required many iterations of tweaking the code, running the build command, running the search, and more code tweaking.
This process took a fair amount of experimentation with fits and starts down the wrong paths, much as any development process. The final changes are roughly outlined below.
The first thing to do was pull the CSS formatting code from the HTML file and place it into the file at:
$SPLUNK_HOME/etc/apps/viz_boxplot_app/appserver/static/visualizations/boxplot/visualization.css
Next, based on suggestions in the tutorial and Smith, the boxparse.js file was pasted directly into the updateView() function in the supplied source file found at:
$SPLUNK_HOME/etc/apps/viz_boxplot_app/appserver/static/visualizations/boxplot/src/visualization_source.js
An immediate problem to tackle is the conversion of the data supplied from the Splunk search to the format coded as discussed above. In future versions this will entail recoding all the data references to pull from the Splunk supplied data structure, but for now there is code to convert the format to wedge it into the format shown above. There are three formats for Splunk sending the data. They are documented at in the Custom visualization API reference. The default is Row-major where a JavaScript object is returned with an array containing field names and the row values from the raw JSON results object. There is a Column-major option which does the same for column values. The third option is Raw, which returns the full JSON object. The approach used here is Raw. This is set in the visualization_source.js file in the getInitialDataParams by changing outputMode from:
outputMode: SplunkVisualizationBase.ROW_MAJOR_OUTPUT_MODE,
to:
outputMode: SplunkVisualizationBase.RAW_OUTPUT_MODE,
This makes it possible to pull all the field value pairs into the app for quick conversion into the index array format in the formatData function using:
var newData = _.map(data.results, function(d) { return [ d[data.fields[0].name], [ Number(d.lowerquartile), Number(d.median), Number(d.upperquartile) ], [ Number(d.lowerwhisker), Number(d.upperwhisker) ], [], Number(d.min), Number(d.max) ]; });
This, also, forces evaluation of the values to numbers for all but the first category field, which should be a string holding the values of the split by field.
The most difficult part to track down at that point was tweaking the display to work correctly drawing a dynamic Y axis with correctly positioned sized box plots relative to that scale. After much experimentation, some padding happened at the top of the graph by adding a programmable label offset using:
var labeloffset = 75;
var height = 400 + labeloffset – margin.top – margin.bottom;
in visualization_source.js within the updateView code to provide some room at the top of the graph. In addition, the height for the box plots scale was changed from:
.range([yMax, min]);
to:
.range([height, yMin]);
This allowed for the Y axis draw and the box plot range to use the same values, which then allows for the positioning and sizing of the two to be relative such that 50 on the axis lines up with 50 for each of the box plots drawn.
At this point, it was simply adding the steps needed to meet Splunkbase standards and packaging the app directory into a tar.gz file and renaming to .spl.
Final Results
The ultimate result is a Box Plot app with results such as the image below (taken from the Box Plot App example screenshot):
The app is available on Splunkbase at https://splunkbase.splunk.com/app/3157/.
The field used must be numeric and be split by another field. The search is longer than many of the normal visualizations, but the density of information displayed and the requirement of pre-computing all values necessitates the specifics.
The search used in the graph example (which comes with the app) uses a lookup with a series of values for the categories shown to provide numeric data and split by categories.
The specific search is:
| inputlookup boxplotexample.csv | stats median(Cost) AS median, min(Cost) AS min, max(Cost) AS max, p25(Cost) AS lowerquartile, p75(Cost) AS upperquartile by Service | fields – count | where isnotnull(median)| eval iqr=upperquartile-lowerquartile | eval lowerwhisker=median-(1.5*iqr) | eval upperwhisker=median+(1.5*iqr)
If the search starting at | stats … is copied and the Cost and Service field names changed, the box plot should draw. If the number ranges in one of the split by field values is in a totally different order of magnitude, the display will not likely be useful for comparisons. In those situations it may be useful to isolate split by field values by the general range for the numeric field (min and max) into different searches using their own box plots. This can be quickly determined by doing a | stats min(field) AS min, max(field) AS max and then sorting as needed to find common groupings.
As an aside, for those new to Splunk app development, speed up the process of reloading app content and configuration file changes, the Splunk docs Customization options and caching: Clear client and server assets caches after customization section suggests using http://<host:mport>/debug/refresh and/or http://<host:mport>/<locale_string>/_bump (e.g. http://localhost:8000/en-US/_bump). Use the debug/refresh end point to reload XML, configuration, and HTML files, and _bump to reload changes to ../appserver/static directories. These were both used many times during the development of the Box Plot app.
The next installment will show the Box Plot App leveraged for a security use case.