Unstructured log file parsing
In this tutorial, I am going to show you some basics about processing text files (or log files) in Python.
The text files can be roughly classified into two categories: structured or non-structured. The structured text files is relatively more straightforward to process as they all follow certain conventions. For the non-structured text files, it is a bit complex as you need to find out certain structures yourself so as to do it automatically. Let's start with the non-structured one first.
Non-structured text file:
There are always some queer fishes......
-----------snippet from a simulation ------
[5/15/2013 2:17:26 PM] Session Start
[5/15/2013 2:17:26 PM] Leaving sequence: loadXML, moving forward.
[5/15/2013 2:17:30 PM] Player submitted name: Carl
[5/15/2013 2:17:30 PM] Leaving sequence: InputNameScreen, moving forward.
[5/15/2013 2:17:31 PM] Player submitted name: Carl
[5/15/2013 2:17:31 PM] Leaving sequence: startScreen, moving forward.
[5/15/2013 2:17:50 PM] Player submitted name: Carl
[5/15/2013 2:17:50 PM] Leaving sequence: slide2, moving forward.
[5/15/2013 2:17:55 PM] Player submitted name: Carl
[5/15/2013 2:17:55 PM] Leaving sequence: slide2b, moving forward.
[5/15/2013 2:18:34 PM] Player submitted name: Carl
[5/15/2013 2:18:34 PM] Leaving sequence: slide2c, moving forward.
[5/15/2013 2:20:09 PM] Player submitted name: Carl
[5/15/2013 2:20:09 PM] Leaving sequence: slide3, moving forward.
[5/15/2013 2:20:13 PM] Player submitted name: Carl
[5/15/2013 2:20:13 PM] Leaving sequence: slide4, moving forward.
How to approach it? Python is your friend!
1. Choose the necessary packages
import pandas as pd
import numpy as np
from datetime import datetime
2. Read the file
with open('unstructured_example_log.txt') as f:
txt = f.readlines()
3. Check the length and contents of the txt
len(txt)
4. Clean the list
n = len(txt)
for i in range(n):
txt[i] = txt[i].strip()
or in a more elegant way
txt = [t.strip() for t in txt]
5. Chucking
txt[0].split(']')
txt[0].split(']')[1]
txt[0].split(']')[1].upper()
6. Datetime
s = txt[0].split(']')[0].strip('[')
dtfmt ='%m/%d/%Y %I:%M:%S %p' # %H -> 24 hours, %I-> 12 hours, for ISO 8601 format, use: %Y-%m-%dT%H:%M:%S.%f%z
dt = datetime.strptime(s, dtfmt)
7. Put into a data frame
first, we need to determine columns
col1 = []
col2 = []
col3 = []
then, we fill the columns
for line in txt:
s1=line.split(']')[0].strip('[')
dt = datetime.strptime(s1, dtfmt)
col1.append(dt)
s= line.split(']')[1].strip().split(':')
col2.append(s[0])
if len(s) == 2:
col3.append(s[1])
else:
col3.append(np.nan)
finally, we create the dataframe
df = pd.DataFrame([col1,col2,col3])
df = df.T
df.columns=['datetime','event_name', 'event_result']
8. Normalize datetime to seconds
df['delta_t'] =df.datetime - df.datetime[0]
convert that to seconds
df['delta_t_seconds'] = 0
for i in range(df.shape[0]):
df.ix[i,'delta_t_seconds'] = df.delta_t.iloc[i].seconds
9. Save to csv file
df.to_csv('test_log.csv', index=False)