Programming‎ > ‎Python‎ > ‎

07-Files




File Processing


• A text file can be thought of as a sequence of lines


Opening a File


• Before we can read the contents of the file, we must tell Python which file we are going to work with and what we will be doing with the file
• This is done with the open() function
• open() returns a “file handle” - a variable used to perform operations on the file
• Similar to “File -> Open” in a Word Processor


Using open()


• handle = open(filename, mode)
> returns a handle use to manipulate the file
> filename is a string
> mode is optional and should be 'r' if we are planning to read the
file and 'w' if we are going to write to the file


What is a Handle


>>> fhand = open('mbox.txt')
>>> print fhand
<open file 'mbox.txt', mode 'r' at 0x1005088b0>


When Files are Missing


>>> fhand = open('stuff.txt')
Traceback (most recent call last): File
"<stdin>", line 1, in <module>IOError: [Errno 2]
No such file or directory: 'stuff.txt'


The newline Character


• We use a special character called the “newline” to indicate when a line ends
• We represent it as \n in strings
• Newline is still one character - not two

>>> stuff = 'Hello\nWorld!'
>>> stuff
'Hello\nWorld!'
>>> print stuff
Hello
World!
>>> stuff = 'X\nY'
>>> print stuff
XY
>>> len(stuff)
3

File Processing


• A text file can be thought of as a sequence of lines

• A text file has newlines at the end of each line
From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008\n
Return-Path: <postmaster@collab.sakaiproject.org>\n
Date: Sat, 5 Jan 2008 09:12:18 -0500\n
To: source@collab.sakaiproject.org\n
From: stephen.marquard@uct.ac.za\n
Subject: [sakai] svn commit: r39772 - content/branches/\n
\n
Details: http://source.sakaiproject.org/viewsvn/?view=rev&rev=39772\n


File Handle as a Sequence


• A file handle open for read can be treated as a sequence of strings where each line in the file is a string in the sequence
• We can use the for statement to iterate through a sequence
• Remember - a sequence is an ordered set

xfile = open('mbox.txt')
for cheese in xfile:
    print cheese

Counting Lines in a  File


Open a file read-only
• Use a for loop to read each line
• Count the lines and print out the number of lines

fhand = open('mbox.txt')
count = 0
for line in fhand:
    count = count + 1
print 'Line Count:', count

$ python open.py
Line Count: 132045


Reading the *Whole* file


• We can read the whole file (newlines and all) into a single string

>>> fhand = open('mbox-short.txt')
>>> inp = fhand.read()
>>> print len(inp)
94626
>>> print inp[:20]
From stephen.marquar


Searching Through a File


• We can put an if statement inour for loop to only print lines that meet some criteria

fhand = open('mbox-short.txt')
for line in fhand:
    if line.startswith('From:') :
        print line

OOPS!

What are all these blank lines doing here?

From: stephen.marquard@uct.ac.za

From: louis@media.berkeley.edu

From: zqian@umich.edu

From: rjlowe@iupui.edu

...

• Each line from the file has a newline at the end
• The print statement adds a newline to each line

From: stephen.marquard@uct.ac.za\n
\n
From: louis@media.berkeley.edu\n
\n
From: zqian@umich.edu\n
\n
From: rjlowe@iupui.edu\n
\n
...


• We can strip the whitespace from the right-hand side of the string using rstrip() from the string library
• The newline is considered “white space” and is stripped

fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if line.startswith('From:') :
        print line

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
....


Skipping with continue


• We can conveniently skip a line by using the continue statement

fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if not line.startswith('From:') :
        continue
    print line


Using in to select lines


• We can look for a string anywhere in a line as our selection criteria

fhand = open('mbox-short.txt')
for line in fhand:
    line = line.rstrip()
    if not '@uct.ac.za' in line :
        continue
    print line

From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008
X-Authentication-Warning: set sender to stephen.marquard@uct.ac.za using –f
From: stephen.marquard@uct.ac.za
Author: stephen.marquard@uct.ac.za
From david.horwitz@uct.ac.za Fri Jan 4 07:02:32 2008
X-Authentication-Warning: set sender to david.horwitz@uct.ac.za using -f.


Bad File Names


Enter the file name: mbox.txt
There were 1797 subject lines in mbox.txt

Enter the file name: na na boo boo
File cannot be opened: na na boo boo

fname = raw_input('Enter the file name: ')
try:
    fhand = open(fname)
except:
    print 'File cannot be opened:', fname
exit()

count = 0
for line in fhand:
    if line.startswith('Subject:') :
        count = count + 1
print 'There were', count, 'subject lines in', fname


CSV File



Let's import our datafile mpg.csv, which contains fuel economy data for 234 cars.

  • mpg : miles per gallon
  • class : car classification
  • cty : city mpg
  • cyl : # of cylinders
  • displ : engine displacement in liters
  • drv : f = front-wheel drive, r = rear wheel drive, 4 = 4wd
  • fl : fuel (e = ethanol E85, d = diesel, r = regular, p = premium, c = CNG)
  • hwy : highway mpg
  • manufacturer : automobile manufacturer
  • model : model of car
  • trans : type of transmission
  • year : model year



import csv

%precision 2

with open('mpg.csv') as csvfile:
    mpg = list(csv.DictReader(csvfile))
    
mpg[:3] # The first three dictionaries in our list.


csv.Dictreader has read in each row of our csv file as a dictionary. len shows that our list is comprised of 234 dictionaries.

len(mpg)
234

keys gives us the column names of our csv.

mpg[0].keys()
odict_keys(['', 'manufacturer', 'model', 'displ', 'year', 'cyl', 'trans', 'drv', 'cty', 'hwy', 'fl', 'class'])

This is how to find the average cty fuel economy across all cars. All values in the dictionaries are strings, so we need to convert to float.

sum(float(d['cty']) for d in mpg) / len(mpg)
16.86

Similarly this is how to find the average hwy fuel economy across all cars.

sum(float(d['hwy']) for d in mpg) / len(mpg)
23.44

Use set to return the unique values for the number of cylinders the cars in our dataset have.

cylinders = set(d['cyl'] for d in mpg)
cylinders
{'4', '5', '6', '8'}

Here's a more complex example where we are grouping the cars by number of cylinder, and finding the average cty mpg for each group.

CtyMpgByCyl = []

for c in cylinders: # iterate over all the cylinder levels
    summpg = 0
    cyltypecount = 0
    for d in mpg: # iterate over all dictionaries
        if d['cyl'] == c: # if the cylinder level type matches,
            summpg += float(d['cty']) # add the cty mpg
            cyltypecount += 1 # increment the count
    CtyMpgByCyl.append((c, summpg / cyltypecount)) # append the tuple ('cylinder', 'avg mpg')

CtyMpgByCyl.sort(key=lambda x: x[0])
CtyMpgByCyl
[('4', 21.01), ('5', 20.50), ('6', 16.22), ('8', 12.57)]

Use set to return the unique values for the class types in our dataset.

vehicleclass = set(d['class'] for d in mpg) # what are the class types
vehicleclass
{'2seater', 'compact', 'midsize', 'minivan', 'pickup', 'subcompact', 'suv'}

And here's an example of how to find the average hwy mpg for each class of vehicle in our dataset.

HwyMpgByClass = []

for t in vehicleclass: # iterate over all the vehicle classes
    summpg = 0
    vclasscount = 0
    for d in mpg: # iterate over all dictionaries
        if d['class'] == t: # if the cylinder amount type matches,
            summpg += float(d['hwy']) # add the hwy mpg
            vclasscount += 1 # increment the count
    HwyMpgByClass.append((t, summpg / vclasscount)) # append the tuple ('class', 'avg mpg')

HwyMpgByClass.sort(key=lambda x: x[1])
HwyMpgByClass
[('pickup', 16.88),
 ('suv', 18.13),
 ('minivan', 22.36),
 ('2seater', 24.80),
 ('midsize', 27.29),
 ('subcompact', 28.14),
 ('compact', 28.30)]




Comments