UD-MiMeS - Arrays of Filenames

Tutorial on the creation of arrays of filenames

For notebook 01 (Calibrations) and 02 (ScienceData), it is necessary to create arrays that contains the filenames of all the files to be considered.

This tutorial will teach you multiple ways this can be achieved, depending on your specific situation.

The simple but long way

Let's consider for this example that you need an array of filenames that has all of the flats.

All of the data is in a folder called /Users/vero/SARA/data/flats. The files are named Flats-001.fit, Flats-002.fit, etc., and let's say there are 10 of them

The easy (but long) solution, it to manually create the array:

filenames = [ '/Users/vero/SARA/data/flats/Flats-001.fit',
'/Users/vero/SARA/data/flats/Flats-002.fit',
'/Users/vero/SARA/data/flats/Flats-003.fit',
...
'/Users/vero/SARA/data/flats/Flats-010.fit' ]

This is by far the simplest conceptual way of doing this, but it of course requires a lot of copy-paste. For example, we repeat /Users/vero/SARA/data every time.

In python, it is possible to add together 2 strings (this is called a concatenation). For example, if a='123' and b = '456', then print(a+b) will return '123456'. Thus we could save ourselves some trouble by storing '/Users/vero/SARA/data' into a variable.

path = '/Users/vero/SARA/data'

filenames = [ path + '/flats/Flats-001.fit',
path + '/flats/Flats-002.fit',
path + '/flats/Flats-003.fit',
...
path + '/flats/Flats-010.fit' ]

Of course, this has also the disadvantage that we need to write 'path + ' 10 times. This is because an operation like:

a = 'path'
b = [ '1' , '2' , '3' ]

print( a + b )

will not return [ 'path1' , 'path2' , 'path3' ], but will instead return TypeError: Can't convert 'list' object to str implicitly.

This is because python does not know how to add a single value (a) to a list (b). The way around this is to create a list on the fly:

path = '/Users/vero/SARA/data'

filenames_tem = [ '/flats/Flats-001.fit',
'/flats/Flats-002.fit',
'/flats/Flats-003.fit',
...
'/flats/Flats-010.fit' ]

filenames = [ path + item for item in filenames_temp ]

Let's decorticate what is happening in the last line of code above.

'for item in filenames_temp' is an iteration. This means that the variable item will take the value of each element in the variable filename_temp in turn.

The [ ] means that the each time the value of item is changed, a new element will be added to the list called filenames.

The 'path + item' part means that for each value of item, the value of path will be added to it before being stored as a new elements in the list. Here's another numerical example, to illustrate this type of 'on the fly' list construction. Let say that instead of strings, we have a list that contains numbers:

a = [ 1, 2, 3, 4 ]

Note that this is a list object, not a numpy array, which would be defined as e.g.

b = np.array([1,2,3,4])

Arithmetic operations on list are weird. For example, if I would like to multiply each element of the list by 2.

With the numpy array, I can do:

print( b * 2 )
>> np.array( [2, 4, 6, 8] )

Which returns the expected results.

But if you do:

print( a * 2 )>> [1, 2, 3, 4, 1, 2, 3, 4]

You will obtain a list that contains the initial list twice!

If you would like to a list with each element multiplied by 2 (usually one would use numpy arrays, but let's say you don't want to):

print( [ x*2 for x in a ] )
>>[2, 4, 6, 8]

which will return the desired result. Note here that I used x instead of item. As the variable used for the iteration is created inside of the command, I can use any variable name I want.
i*2 for i in a
banana*2 for banana in a

The slightly more complex but shorter way (if the filenames only differ by a number)

In the last example above, we still had to manually create an array with the filenames, to which we added on the path. But if the names are only different by e.g., a number, we can do better by using the format method that can be applied to string.

Here's an example:

name = 'Vero'
print( 'Hello, my name is { }'.format(name) )
>> Hello, my name is Vero

the format command will insert the parameter (here the content of the variable name) at the position of the { }.

You can see how this could be useful is instead of a single name, you had a list of names:

names = [ 'Vero', 'Jamie', 'Claude' ]
greetings = ['Hello, my name is { }'.format( item ) for item in names ]

The content of greetings is now:
[ 'Hello, my name is Vero', 'Hello, my name is Jamie', 'Hello, my name is Claude']

Thus we could use this to our advantage:

path = '/Users/vero/SARA/data'

number = [ '001', '002', '003', '004', ... '010' ]

filenames = [ '{}/flats/Flats-{}.fit'.format(path, item) for item in number ]

You will notice above that instead of using path+, I simply added a {} and included path into the parameters of the format method.

If the numbers are consecutive (here between 1 and 10, with a +1 increment), we can simplify things further. Instead of entering all of the numbers by hands ( number = [ '001', '002', '003', '004', ... '010' ] ), we can generate them 'on-the-fly' by using the range function.

item in range(1,11) will create an iterative object for which the value of item will in turn take values between 1 and 10. Note how the second parameter of the function is one number above the last value we want in our list. The range function continues to iterate while item < 11. As by default the function is iterating by an increment of 1, it will stop when item=10.

You can add a 3rd parameter to the function to create a different increment: item in range(1,11,2) will create an iterator such that item = 1, 3, 5 ...

So great. We could do:

filenames = [ '{}/flats/Flats-{}.fit'.format(path, item) for item in range(0,11) ]

The result however is:

[ '/Users/vero/SARA/data/flats/Flats-1.fit',
'/Users/vero/SARA/data/flats/Flats-2.fit',
....

Oupsy, we are missing the 00 in front of the numbers.

OK, so we could instead do:

filenames = [ '{}/flats/Flats-00{}.fit'.format(path, item) for item in range(0,11) ]

The result will be good for the 9 first files, but the 10th file in the list would look like this:

'/Users/vero/SARA/data/flats/Flats-0010.fit'

How can we tell our code to always have 3 digits in the name, padding to zero as needed?

The trick is to use format codes:

path = '/Users/vero/SARA/data'

filenames = [ '{}/flats/Flats-{:03}.fit'.format(path, item) for item in range(0,11) ]

In the code above, the :03 inside the {} tell the computer how to transfer a number (the result of the range function is a number, not a string) into a string. There are different ways that a number could be formatted for display.

For example if a=10, the computer could display the content of the variable as 10, 10.0, 10.000, 1e2, etc etc. There are a set of internal rules that guide the computer on which format the user is likely to find the most useful (e.g. use scientific notation if the number is very large or small). But in our case, we want to number to be formatted in a strict way, to match our filename.

: is to tell the computer that was follows is the format code. 03 tells the computer that we want the string to have 3 characters in it, and that any blank space should be filled with zeros. Thus {:03} converts 10 to '010' and converts 1 to '001'.

Just using {:3} would convert 10 to ' 10' (note the black space).

What is the filename are different by more than a number?

Let's imagine a case where half of the flats are named: Flat-001.fit, Flat-002.fit, etc. But for some reason, the second half of the flats have a different name pattern: Flat-spec-0006.fit, Flat-spec-0007.fit, etc.

In this case, we could create two separate lists using the method above, and concatenate the two lists together.

path = '/Users/vero/SARA/data'

filenames = [ '{}/flats/Flats-{:03}.fit'.format(path, item) for item in range(0,6) ] + [ '{}/flats/Flats-spec-{:04}.fit'.format(path, item) for item in range(6,11) ]

What if the numbers are not sequential?

Let imagine that the after looking at the flat files, you find that a few of them are not good (perhaps somebody closed the flat lamp by mistake for a few of the exposures).

In this case, you could always revert to defining the numbers by hand, like we did previously:

path = '/Users/vero/SARA/data'

numbers = [ 1, 2, 3, 5, 8, 10 ] # using only the good files

filenames = [ '{}/flats/Flats-{:03}.fit'.format(path, item) for item in numbers ]

Note here that I kept the formatting code :03, because I created a list of numbers, instead of a list of strings (which would look like: [ '001', '002', ... ]

Report abuse