Now that we know, from a system programmer's point of view, the basics of process and I/O manipulation in UNIX, we first need to make a step backward and see how the system manages files descriptors, and what exactly is in that file table the descriptors index. From here we'll be able to understand some useful file descriptor shuffling games that are commonplace in UNIX programming, and then take a step forward to deal with using particular file descriptors, the pipes, to make processes "talk'' with each other. This will also shed light on an issue that we have deliberately overlooked so far, namely the reason for separating the process creation step from the execution: after all, if the main reason for forking processes is to execute programs in them, what's the point in having two system calls, fork() and exec, and not a just one that does the whole job, as is the case in M$-DOG?
The UNIX kernel uses a three-table data structure to manage open files. This apparently complicated structure is actually very flexible and elegant, and allows open files to be shared among processes in a simple way.
Figure 6. Kernel data structures for open files.
With reference to Figure 6 we can see each process entry in the process table contains a table of open file descriptors. The file descriptors index the entries in this table, and each entry contains a pointer to an entry in a table that the kernel mantains for all open files. Each entry in this open files table contains the file status flags (read, write, append, etc.), the current file offset, and a pointer to the entry for this file in the so-called v-node table. This table (or part of it) is stored on the physical device; we won't concern with it for now: you can't think of its entries as the "real'' file contents on a disk, with associated informations like file location on disk, size, name, owner, etc..
Now, what if two processes want to independently open the same physical file, i.e. with each process mantaining its own access mode (read, write or append) and file offset as well? Easy: move the lower right arrow to point at the upper right box, and you get the situation depicted in Figure 7: here the open file table has two independent entries for the same file, each associated to one of the processes.
Figure 7. Two independent processes with the same file open.
Now, what happens with this arrangement when I/O operatins are performed?
After each write() is complete, the offset in the file table entry is incremented by the number of bytes written. If this causes the file size to increase, this information is udated in the v-node table entry.
If a file is opened O_APPEND, a flag is set in the file descriptor table entry. Each time a write() is performed, the offset in the file table entry for this file is first set to current filke size. This forces the data to be appended at the end of the file.
The lseek() only reads/modifies the offset in the file table entry: no I/O takes place.
Every read() is performed at the current offset.
Now it's easy to see that this scenario, though useful, is the source of various types of troubles collectively referred to as race conditions, that are exactly due to the possibility of concurrent access at the same file by different processes.
Consider, for example, the following situation. Both processes in fig. 7 have opened file ``bozo'' in O_WRONLY mode; process #13 decides to append some data at the end, so it first moves at the end using lseek(), and ... rigth before it can write() it's suspended, and process #22 sneaks in, and wants to do same thing. So it moves at the end too, then writes and goes off. Now the real end of the file has moved. Process #13 wakes up and resumes writing at what it thinks is still the end, actually erasing away #22's data.
Again, as with the race condition that we previously observed with creat(), the problem is in the use of two system calls and in the possibility of concurrent access between them. The solution relies using one call, and making atomic, i.e. indivisible with respect to process switch, the operation of appending. This is the reason for having the O_APPEND mode for open(). If this mode is set, every append operation is guaranteed by the system to be atomic.
Now what if we need to have in one process two file descriptors opened for the same file. Easy again: go back to Figure 6 and move the lower left arrow to the center up box, and you get the situation depicted in Figure 8.
Figure 8. Duplication of file descriptors.
Fine, this means that the kernel data structure can support this arrangement, but why would you want to do this? We'll see a sound reason in a moment. Let's first see how we achieve this result. Note that something like
...
fd1=open("samefile", O_RDONLY);
fd2=open("samefile", O_RDONLY);
...
doesn't work, since open() always creates a new entry in the open file table, and so you'd end having, for example, two independent offsets. What we want instead is a way to ``duplicate'' one open file descriptor, getting a new one that refers to the same entry in the file table. The dup() and dup2() system calls do exactly this job. They are declared as:
int dup(int fildes);
int dup2(int fildes, int fildes2);
The former duplicates the passed file descriptor, returning the smallest available one. The latter allows you to specify in what file descriptor you want the copy, since it duplicates fildes into fildes2, returning it as well. If fildes2 happens to be already open, dup2() closes it before duplication, unless it's equal to fildes: in this case no duplication occurs, and the file remains open. Both functions, needless to say, return -1 in case of troubles.
Now, back to the reasons for bothering with this xeroxing of descriptors. Suppose that you have a file open for reading and writing, and you have already written some data in it and now want to feed it, via fork-exec, to one of those little ``filter'' commands of UNIX, like sed, or awk, that read a a stream of characters from their standard input (stdin, for short), do something useful on them, and write the the result on their standard output. If your file were the stdin there would be no problems, since a child inherits it, but that's obviously not the case (you can't write on stdin). Then how do you let the child have the file in its stdin?
One solution is to close stdin first, then re-open the file immediately after: this makes use of a property of open(): it always returns the smallest unopened file descriptor, so if you open a file imediately after closing stdin, you get its file descriptor back. The resulting code would look like the following:
...
datafd=open("datafile", O_RDWR|O_CREAT, 0644);
...
/* Some data are written in the file */
...
close(STDIN_FILENO);
open("datafile", O_RDONLY); /* Now STDIN_FILENO -> "myfile" */
if (fork()==0)
{
execlp("sed", "sed", (char *)0);
perror("sed");
}
...
However, this prevents the parent from using its stdin any further, so it's probably better do the close-and-reopen trick in the child, since it inheriths the data file as well. Hence a refined solution would be:
...
datafd=open("datafile", O_RDWR|O_CREAT, 0644);
...
/* Some data are written in the file */
...
if (fork()==0)
{
close(STDIN_FILENO); /* Only child's stdin is closed */
open("datafile", O_RDONLY);
execlp("sed", "sed", (char *)0);
perror("sed");
}
This is the first example of the fact that separating file creation from execution (i.e. fork from exec) is not a bad idea after all: it allows to rearrange file descriptors in the child differently from the parent's before executing. We'll see that this feature is a great asset in lots of cases.
Fine, but what if the child doesn't know the name of the file? What is it supposed to open in its stdin then? Suppose, for instance, that the parent of the above example doesn't open ``datafile'' itself, but just inherits its descriptor from the parent's parent. The child could go through the pain of poking in the kernel tables, looking for the blessed name, but it's much easier to duplicate the descriptor as follows
...
if (fork()==0)
{
close(STDIN_FILENO);
dup(datafd); /* datafd duplicated in stdin */
execlp("sed", "sed", (char *)0);
perror("sed");
}
Fine again, except that there are two possible flaws in this scheme. The first is that before closing stdin we should first, as a defensive measure, make sure that datafd does not happen to be the same as stdin, otherwise after the close it's gone for good, and the subsequent dup() fails. The second flaw is subtle but, by Murphy's Law, deemed to happen sometime. The close-and-duplicate sequence above is not atomic. A signal handler, a peculiar UNIX dwarf that we'll meet later, may then show up between close() and dup(), open a file for the child and go away. Then dup() finds stdin already occupied, returns another file descriptor, and poor exec-ed sed ends in a royal mess.
The use of dup2() solves both problems: it takes care of the closing, does nothing if its arguments are the same, and its closing-duplicating sequence is guaranteed to be atomic: a real piece of cake. So, a better solution is:
...
if (fork()==0)
{
dup2(datafd, STDIN_FILENO);
execlp("sed", "sed", (char *)0);
perror("sed");
}
Note that these ``duplicate and go'' approaches assume that the mode the file is open in is the right one, since it remains unchanged through a duplication.