Google sent me a nice message to start the year – “Your inbox is reaching its limit”.
Looking at my GMail inbox I have well over 70k emails, taking up just under 15GB of space. I’m interested in how this number is made up – who emails me the most, who I email, what time I’m most productive, etc.
I decided to download my GMail archive using Google Takeout to analyse the data. Here’s how I did it.
Download Your Inbox
First, use Google Takeout to download your GMail mailbox. Depending on the amount of emails you have accumulated this might take a while. My ~15GB took about an hour.
Once complete, Google will give you a .zip file. Download and unzip it. You should see a file named something like “<my_gmail_inbox>.mbox”.
Upload the .mbox file to Splunk
If your confident with editing props.conf directly, ignore the next paragraph.
Using the file uploader in Splunk, select your .mbox file using the option “Preview Data Before Indexing”. We will use the data preview to teach Splunk what .mbox events look like so that they are indexed correctly.
Using the “Advanced Mode” tab you can create the props.conf in the GUI. To get data indexing correctly, I suggest a props.conf structure similar to the following:
[gmail-mbox] #remove this line if using the Splunk GUI "Advanced Tab" MAX_EVENTS = 100000 BREAK_ONLY_BEFORE = From\s.+?@ MAX_TIMESTAMP_LOOKAHEAD = 150 NO_BINARY_CHECK = 1 TRUNCATE = 100000 MAX_DAYS_AGO=3652
Let me describe what is being set here:
- MAX_EVENTS = Specifies the maximum number of input lines to add to any event. Example=”100000″. Default=”256″. Some of my messages were over 1000 lines so I shot for 1000x this number.
- BREAK_ONLY_BEFORE = Splunk creates a new event if it encounters a new line that matches the regular expression set. Example=”From\s.+?@”. This breaks the GMail events in the correct place (before the line starting: “From xxxx@…”.
- TRUNCATE = The default maximum line length (in bytes). Example=”10000″. Default=”100000″. 100000 used in this example seems unlikely to be broken unless a really messy message is found.
- MAX_DAYS_AGO= Specifies the maximum number of days past, from the current date, that an extracted date can be valid. Example=”3652″. Default=”2000″. Given that I had messages older than 5 years (1826 days), I increased this to 10 years (3652 days)
More information can be found in the docs here. You should read them
Indexing the data
Now all you need to do is set this input as a new sourcetype (in the props.conf above I’ve used “gmail-mbox”) and then upload the file into Splunk.
A simple search for “sourcetype=gmail-mbox” should show all your events indexed and broken apart nicely.
As you can see from the screenshot above the events can vary quite drastically, e.g 21 line event to 821 line event. I have a number of events which are thousands of lines long (mainly the result of email bodies filled with HTML).
The histogram returned immediately gives us a good indication of month-on-month message volume. Note, this search shows both sent and received messages from your GMail account.
Field extraction
You’ll see that fields will not have been extracted correctly from your events, so we need to teach Splunk what this new .mbox format looks like.
For this first exercise I am only interested in the “labels”, “to”, and “from” fields. Here are the extractions I used in my props.conf here:
[gmail-mbox] #remove this line if using the Splunk GUI "Advanced Tab" ... # variables set earlier EXTRACT-gmail-mbox-labels = X-Gmail-Labels\s*:\s*(?P<X_Gmail_Labels>[\w]+,[\w]+) EXTRACT-gmail-mbox-from = From\s*:\s*(.*?)(?P<gmail_from>[\w]+@[\w]+.[\w]+) EXTRACT-gmail-mbox-to = To\s*:\s*(.*?)(?P<gmail_to>[\w]+@[\w]+.[\w]+)
As you can see my regular expression skills are weak and I’m sure you can improve upon these extractions. I needed a fair bit of help just to get this far.
If anyone wants to share how they would pull out fields from GMail’s .mbox file format (or similar email format for that matter), join the conversation over on Splunk Answers or leave a comment on the post. Lots of kudos on offer
Search on
Here are some example searches to get you started.
Number of emails you’ve sent
sourcetype="gmail-mbox" gmail_labels=*Sent* | stats count
People you’ve received the most emails from:
sourcetype="gmail-mbox" NOT gmail_from=my@email.com | top limit=10 gmail_from
People you’ve sent the most emails to:
sourcetype="gmail-mbox" NOT gmail_to=my@email.com | top limit=10 gmail_from
To do
- More interesting queries
- Fine tune existing extractions
- Add more extractions