Unstructured log file parsing

In this tutorial, I am going to show you some basics about processing text files (or log files) in Python. 

The text files can be roughly classified into two categories: structured or non-structured. The structured text files is relatively more straightforward to process as they all follow certain conventions. For the non-structured text files, it is a bit complex as you need to find out certain structures yourself so as to do it automatically. Let's start with the non-structured one first. 

Non-structured text file:

There are always some queer fishes......

-----------snippet from a simulation ------

[5/15/2013 2:17:26 PM] Session Start

[5/15/2013 2:17:26 PM] Leaving sequence: loadXML, moving forward.

[5/15/2013 2:17:30 PM] Player submitted name: Carl

[5/15/2013 2:17:30 PM] Leaving sequence: InputNameScreen, moving forward.

[5/15/2013 2:17:31 PM] Player submitted name: Carl

[5/15/2013 2:17:31 PM] Leaving sequence: startScreen, moving forward.

[5/15/2013 2:17:50 PM] Player submitted name: Carl

[5/15/2013 2:17:50 PM] Leaving sequence: slide2, moving forward.

[5/15/2013 2:17:55 PM] Player submitted name: Carl

[5/15/2013 2:17:55 PM] Leaving sequence: slide2b, moving forward.

[5/15/2013 2:18:34 PM] Player submitted name: Carl

[5/15/2013 2:18:34 PM] Leaving sequence: slide2c, moving forward.

[5/15/2013 2:20:09 PM] Player submitted name: Carl

[5/15/2013 2:20:09 PM] Leaving sequence: slide3, moving forward.

[5/15/2013 2:20:13 PM] Player submitted name: Carl

[5/15/2013 2:20:13 PM] Leaving sequence: slide4, moving forward.

How to approach it? Python is your friend!

1. Choose the necessary packages

import pandas as pd

import numpy as np

from datetime import datetime

2. Read the file

with open('unstructured_example_log.txt') as f:

    txt = f.readlines()

3. Check the length and contents of the txt

len(txt)

4. Clean the list

n = len(txt)

for i in range(n):

    txt[i] = txt[i].strip()

or in a more elegant way

txt = [t.strip() for t in txt]

5. Chucking

txt[0].split(']')

txt[0].split(']')[1]

txt[0].split(']')[1].upper()

6. Datetime

s = txt[0].split(']')[0].strip('[')

dtfmt ='%m/%d/%Y %I:%M:%S %p'   # %H -> 24 hours, %I-> 12 hours, for ISO 8601 format, use: %Y-%m-%dT%H:%M:%S.%f%z


dt = datetime.strptime(s, dtfmt)

7. Put into a data frame

first, we need to determine columns

col1 = []

col2 = []

col3 = []

then, we fill the columns

for line in txt:

    s1=line.split(']')[0].strip('[')

    dt = datetime.strptime(s1, dtfmt)

    col1.append(dt)

    s= line.split(']')[1].strip().split(':')

    col2.append(s[0])

    if len(s) == 2:

        col3.append(s[1])

    else:

        col3.append(np.nan)

finally, we create the dataframe

df = pd.DataFrame([col1,col2,col3])

df = df.T

df.columns=['datetime','event_name', 'event_result']

8. Normalize datetime to seconds

df['delta_t'] =df.datetime - df.datetime[0]

convert that to seconds

df['delta_t_seconds'] = 0

for i in range(df.shape[0]):

    df.ix[i,'delta_t_seconds'] = df.delta_t.iloc[i].seconds

9. Save to csv file

df.to_csv('test_log.csv', index=False)