Splunk and MapReduce
Splunk and MapReduce
Splunk Inc. | 250 Brannan Street | San Francisco, CA 94107 | www.splunk.com | info@splunk.com Copyright 2011 Splunk Inc. All rights reserved.
Background:
Splunk is a general-purpose search, analysis and reporting engine for time-series text data, typically machine data. Splunk software is deployed to address one or more core IT functions: application management, security, compliance, IT operations management and providing analytics for the business. Splunk versions 1-2 were optimized to deliver a typical Web search experience, where a search is run and the first 10 to 100 results are displayed quickly. Any results past a fixed limit were irretrievable, except if some other search criterion were added. Under these conditions Splunk scalability had its limitations. In version 3 Splunk added a statistical capability, enhancing the ability to perform analytics. With this addition, searches could be run against the index to deliver more context. In response to input from Splunk users and to address the challenges of large-scale data processing, in July 2009 Splunk 4 changed this model by adding the facility to retrieve and analyze massive datasets using a "divide and conquer" approach: the MapReduce mechanism. With version 4, Splunk optimized the execution of its search language using the MapReduce model where parallelism is key. Of course, parallelizing analytics via MapReduce is not unique to Splunk. However, the Splunk implementation of MapReduce on top of an indexed datastore with its expressive search language provides a simpler, faster way to analyze huge volumes of machine data. Beyond this, the Splunk MapReduce implementation is part of a complete solution for collecting, storing and processing machine data in real time, thereby eliminating the need to write or maintain codea requisite of more generic MapReduce implementations. This paper describes the Splunk approach to machine data processing on a large scale, based on the MapReduce model.
What is MapReduce?
MapReduce is the model of distributed data processing introduced by Google in 2004. The fundamental concept of MapReduce is to divide problems into two parts: a map function that processes source data into sufficient statistics and a reduce function that merges all sufficient statistics into a final answer. By definition, any number of concurrent map functions can be run at the same time without intercommunication. Once all the data has had the map function applied to it, the reduce function can be run to combine the results of the map phases. For large scale batch processing and high speed data retrieval, common in Web search scenarios, MapReduce provides the fastest, most cost-effective and most scalable mechanism for returning results. Today, most of the leading technologies for managing "big data" are developed on MapReduce. With MapReduce there are few scalability limitations, but leveraging it directly does require writing and maintaining a lot of code. This paper assumes moderate knowledge of the MapReduce framework. For further reference, Jeffrey Dean and Sanjay Ghemawat of Google have prepared an introductory paper on the subject: http://labs.google.com/ papers/mapreduce.html. Additionally, the Google and Yahoo papers on Sawzall http://labs.google.com/papers/ sawzall.html and Pig http://research.yahoo.com/project/90 provide good introductions to alternative abstraction layers built on top of MapReduce infrastructures for large-scale data processing tasks.
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk.
Copyright 2011 Splunk Inc. All rights reserved.
Page 1 of 7
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk.
Copyright 2011 Splunk Inc. All rights reserved.
Page 2 of 7
join-like lookups. Fields are also constructed on the fly in search using commands like eval for evaluating expressions, rex for running regular expression-based extractions and lookup for applying a lookup table. Below are examples of some common and useful searches in the Splunk Search Language: search googlebot This search simply retrieves all events that contain the string googlebot. This is very similar to running grep iw googlebot on the data set, though this case benefits from the index. This is useful in the context of Web access logs, where a common task is understanding what common search bots are causing hits and which resources they are hitting. search sourcetype=access_combined | timechart count by status This search retrieves all data marked with the sourcetype "access_combined," which is typically associated with web access log data. The data is then divided into uniform time-based groups. For each time group, the count of events are calculated for every distinct HTTP "status" code. This search is commonly used to determine the general health of a web server or service. It is especially helpful for noticing an increase of errors, either 404s which may be associated with a new content push or 500s which could be related to server issues. search sourcetype=access_combined | transaction clientip maxpause=10m | timechart median(eventcount) perc95(eventcount) This search again retrieves all web access log data. The events are then gathered into transactions identified by the client IP address, where events more than 10 minutes apart are considered to be in separate transactions. Finally, the data is summarized by time and, for each discrete time range, the median and 95th percentile session length is calculated.
Page 3 of 7
This search first retrieves all events from the sourcetype access_combined. Next, the events are discretized temporally and the number of events for each distinct status code is reported for each time period. In this case, search sourcetype=access_combined would be run in parallel on all nodes of the cluster that contained suitable data. The timechart count by status part would be the reduce step, and would be run on the searching node of the cluster. For efficiency, the timechart command would automatically contribute a summarizing phase to the map step, which would be translated to search sourcetype=access_combined | pretimechart count by status. This would allow the nodes of the distributed cluster to only send sufficient statistics (as opposed to the whole events that come from the search command). The network communications would be tables that contain tuples of the form (timestamp, count, status). eventtype=pageview | eval ua=mvfilter(eventtype LIKE "ua-browser-%") | timechart dc(clientip) by ua Events that represent pageviews are first retrieved. Next, a field called ua is calculated from the eventtype fields, which contains many other event classifications. Finally, the events are discretized temporally and the number of distinct client IPs for each browser is reported for each time period. In this case, search eventtype=pageview | eval ua=mvfilter(eventtype LIKE ua-browser-%) would be run in parallel on all nodes of the cluster that contained suitable data. The timechart dc(clientip) by ua part would be the reduce step, and would be run on the searching node of the cluster. Like before, the timechart command would automatically contribute a summarizing phase to the map step, which would be translated to search eventtype=pageview | eval ua=mvfilter(eventtype LIKE ua-browser-%) | pretimechart dc(clientip) by ua. The network communications would be tables that contain tuples of the form (timestamp, clientip, ua, count).
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk.
Copyright 2011 Splunk Inc. All rights reserved.
Page 4 of 7
dramatically from the distribution of work. Searches that are dominated by I/O costs (needle-in-a-haystack type searches) and those that are dominated by decompression costs (retrieving many records well distributed in time and not dense in arrival) both benefit dramatically from Distributed Search. In the first case the effective I/O bandwidth of the cluster scales linearly with the number of machines in the cluster. In the second case the CPU resources scale linearly. For the majority of use cases (i.e., needle-in-a-haystack troubleshooting or reporting and statistical aggregation) scalability is only limited by the amount of indexers dedicated to the processing task.
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk.
Copyright 2011 Splunk Inc. All rights reserved.
Page 5 of 7
Achieving this same task using the Splunk Search Language is far more efficient:
source=<filename> | stats count sum(_raw) sumsq(_raw)
Note: Published docs from Google tout Sawzall as being a far simpler language than using a MapReduce framework (like Hadoop) directly. According to Google, typical Sawzall programs are 10-20 times shorter than equivalent MapReduce programs in C++.
The efficiency and ease with which Splunk delivers scalability over massive datasets is not confined to this comparison. The contrast becomes sharper if one must properly extract value(s) from each line of the file. Splunk provides field extractions from files by delimiters or regular expressions, either automatically for all files of a certain type or on a search-by-search basis. Additionally, Splunk enables users to learn regular expressions interactively based on examples from within a data set, so the user and product build knowledge over time.
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk.
Copyright 2011 Splunk Inc. All rights reserved.
Page 6 of 7
###
Technical Paper: Large-Scale, Unstructured Data Retrieval and Analysis Using Splunk.
Copyright 2011 Splunk Inc. All rights reserved.
Page 7 of 7