Convert key-value to flat file
Post date: Oct 24, 2015 4:41:27 AM
We want to convert key-value file into a flat format file. Each key-value record is in the format of key:value and the key-value records are separated by a delimiter ','
. Each line of the file will be map to one record in the output and will be rearranged according to the column sequence defined by the user. The code still works even when the number of records in each row not equal or the order of the keys are different by each row. For example,
Input file: test_example.csv
user:123,age:23,sex:F,city:Boston
user:562,age:45,sex:M,state:WA
user:7843,sex:M,state:FL,city:Miami
user:507,state:WA,age:45,city:Seattle,sex:M
city:Woodinville,sex:F,user:95402,state:,age:15
desired output
user,age,sex,state,city
123,23,F,None,Boston
562,45,M,WA,None
7843,None,M,FL,Miami
507,45,M,WA,Seattle
95402,15,F,None,Woodinville
Follow steps here:
- Determine the header variables. First, we list all the unique variables in the input file
test_example.csv
using this command:
sed 's/\,/\n/g' test_example.csv | perl -pe 's/:(.*?)$//g' | sort | uniq > test_example_header
- The output would look like this:
age
city
sex
state
user
- But, you might want to rearrange or even exclude some variables in the output, so you will do it in this step. I use
vi
to edit the header variables and here is the editedtest_example_header
. This header file will be used in the next step to populate the variables from the original input file with the same order as determined by the user.- user
age
sex
state
city
- Run this program to convert from key-value format to flat file.
python3.3 create_flat_file.py -h test_example_header -c test_example.csv -o output -b 10000 -d , -k : -m 'None'
- The python code is included in Appendix A. The arguments are as follow:
- -h <header_file> -c <content_file> -o <output_file> -b <buffer_size> # 10000 is default and recommended -d <delim_col> # user:507,state:WA,age:45 delim_col is comma (,) -k <delim_key> # user:507,state:WA,age:45 delim_key is colon (:) -m <miss_fill> # what to add in case of missing value
- Concatenate header file (output.sig) and the content file (output)
Appendix A: create_flat_file.py
Appendix A: create_flat_file.py
#!/apollo/env/MLEnvImprovement/bin/python3.3 import sys, getopt def create_header_list(path_to_header): """ This is a helper function to convert a header file (one variable for each row) to a list of header. """ # make header list var_list = [] for line in open(path_to_header): var = line.strip() print("var: %s, len: %d" % (var, len(var)) ) if len(var) > 0: var_list.append(var) # display the variable list print(var_list) return var_list def convert_to_flat(path_to_header, path_to_content, path_to_output, buffer_size=10000, delim_col=',', delim_key=':', missing_fill='None'): """ This program converts key:value file into csv flat file. Given the header (path_to_header), the program read and convert the content file (path_to_content) line by line. I ran the program on a 51GB (27M rows x >30 columns) and it took 2 hours. Example: Input: user:123,age:23,sex:F,city:Boston user:562,age:45,sex:M,state:WA user:7843,sex:M,state:FL,city:Miami [intentionally blanked] user:507,state:WA,age:45,city:Seattle,sex:M city:Woodinville,sex:F,user:95402,state:,age:15 [intentionally blanked] [intentionally blanked] Header output: user,age,sex,state,city Content output: 123,23,F,None,Boston 562,45,M,WA,None 7843,None,M,FL,Miami 507,45,M,WA,Seattle 95402,15,F,None,Woodinville kittipat@gmail.com 2015-10-23 """ # create header list var_list = create_header_list(path_to_header) # output the header header_file = open(path_to_output + '.sig', 'a') header_file.write(','.join(var_list)) header_file.close() # open the output file out_file = open(path_to_output, 'a') str_buffer = '' # initialize dict d = {v: None for v in var_list} # For loop to fill each variable, read file line by line cnt = 1 for line in open(path_to_content): # reset hash values d = d.fromkeys(d, None) # loop to hash key-value if (len(line) > 0) & (delim_key in line): # put into hash for chunk in line.split(delim_col): s = chunk.split(delim_key) d[s[0]] = s[1].strip() # loop to write each variable into a line my_line = [] for var in var_list: if d[var]: my_line.append(d[var]) else: my_line.append(missing_fill) # write to string buffer str_buffer = str_buffer + ','.join(my_line) + '\n' # write to a file when buffer is full if ( cnt % buffer_size) == 0: print('@line: %d' % cnt) out_file.write(str_buffer) str_buffer = '' # reset the string buffer #else: print("Warning 'Invalid line' -- line: %d, len: %d, content: \"%s\"" % (cnt, len(line), line)) # increment cnt = cnt + 1 # if buffer is not written yet if len(str_buffer) > 0: out_file.write(str_buffer) # close the file handle out_file.close() return 1 def main(argv): # Set default values path_to_header = None path_to_content = None path_to_output = None buffer_size = 10000 delim_col = ',' delim_key = ':' missing_fill = 'None' try: opts, args = getopt.getopt(argv,"h:c:o:b:d:k:m:",["hfile=","cfile=","ofile=","buffer=","delim_col=","delim_key=","miss_fill="]) except getopt.GetoptError: print('create_flat_file.py -h <header_file> -c <content_file> -o <output_file> -b <buffer_size> -d <delim_col> -k <delim_key> -m <miss_fill>') sys.exit(2) # loop to receive value from arg for opt, arg in opts: if opt in ("-h", "--hfile"): path_to_header = arg elif opt in ("-c", "--cfile"): path_to_content = arg elif opt in ("-o", "--ofile"): path_to_output = arg elif opt in ("-b", "--buffer"): buffer_size = int(arg) elif opt in ("-d", "--delim_col"): delim_col = arg elif opt in ("-k", "--delim_key"): delim_key = arg elif opt in ("-m", "--miss_fill"): missing_fill = arg # display the argument values print('hfile: %s' % path_to_header) print('cfile: %s' % path_to_content) print('ofile: %s' % path_to_output) print('buffer: %s' % buffer_size) print('delim_col: %s' % delim_col) print('delim_key: %s' % delim_key) print('miss_fill: %s' % missing_fill) # create the flat file convert_to_flat(path_to_header , path_to_content , path_to_output , buffer_size , delim_col , delim_key , missing_fill) if __name__ == "__main__": main(sys.argv[1:])