I was trying to help my girlfriend figure out where all her disk space had gone when I realized that Windows doesn't provide any tools to really figure out where the space is being used up. In particular, there's no way to glance at a directory listing (perhaps forthe root of the C:\ drive) and see how much disk space each of the subdirectories consumes. You can query the size of each directory by right-clicking its name and selecting properties, but there's nothing that gives you an overview.
(of course there're 3rd party tools to do this but that's no fun)
I decided to see how quickly I could write up code that could answer the question: how much space is each subdirectory using up?
The thought of using Unix tools for this is a bit contried because... well because we're analyzing Windows drives here. And of course if you have Cygwin installed (which is how I am doing this) then you have the du command at your disposal. But at any rate, the point is to come up with a challenge and overcome it. The fact that I could do this in a matter of minutes speaks to the versatility of Unix filters.
The first step in all of this is to gather the information about the files on my system. For this I used the good old DIR command that we all know and love from the DOS days. In particular, I got a command line window on Windows XP (Start->Run->CMD), changed to the root directory of Drive C (cd c:\) and ran the following command:
DIR /S > dirlist
The /S switch to DIR causes it to recursively go into each subdirectory, and to print the contents of it as well as any of its subdirectories. The whole thing is stored in file called dirlist, which, when the DIR is done, looks like this:
Volume in drive C has no label.
Directory of C:\MATLAB6p5
and so on... you can see here that each directory is printed, then each directory is entered, printed, its subdirectories entered, etc etc. The point to note here is that the name of the directory being processed is printed on top of each section (eg Directory of C:\MATLAB6p5\bin) and that the size of the files in that directory ( but not of files in its subdirectories) is printed on the bottom of the section. EG the one file which lives in the bin directory above is 1.2K while the 2 files one level up add up to 48k. The reason that the file count for C:\ says 4 but you only see 2 is because I truncated that section ;)
From this point on, we're going to switch to Unix tools. I did that by running the subsequent commands in Cygwin. Another alternative would be to transfer the dirlist file to a Unix machine. Yet another approach would be to download a Windows version of the Unix tools.
The first thing I do is realize that the only two pieces of data that really matter from the above output are the name of the directory and the size of the files in that directory. For example, with respect to the last section above, I only care that C:\MATLAB6p5\bin takes up 1,230 bytes. How do I cut down on my output so it only contains that data?
The thing I noticed is that both the line containing the directory name (" Directory C:\MATLAB6p5\bin") and the line which reports the size of the dir's content begin with a space, while the lines between them do not begin w. a space (they begin with a date)... To get rid of all the lines which do not begin with space, I did the following command:
grep "^ " dirlist > pruned1
The above command scans the file dirlist and only takes lines which contain the regular expression "^ ". ^ means beginning of the line, so "^ " means "lines which have a space right after the beginning." We store the result in a file called "pruned1"
On close expection of pruned1, we see clearly now that the original output had a few lines of header above and footer bellow the entire output, which gave you some big totals. For our purposes, we don't need that, so I just used a text editor to cut away those lines, such that the very first line of pruned1 is now the " Directory of C:\" and the last line tells me how many files and their total size for the very last directory in the output. In general, pruned1 looks like this (taking random section of the file as an example)
Directory of C:\cygwin\usr\doc\ctags-5.5
We want to make the output even more concise, for which we will use the 'paste' command later. First, I want to make a file containing just the directory names, and one containing just the sizes.
fgrep Directory pruned1 | cut -c15- > pruned_dir
This does three things: first, it find each line which contains the "Directory" heading (which happens to be every other line) in file pruned one. Then it pipes it to the cut command which takes -c15- parameter to mean "print out each line starting at character 15". I use 15 because that's where the beginning of each directory name actually is (ie C:\...). So the effect of that cut command is to convert lines like " Directory of C:\cygwin\usr\i686-pc-mingw32" into lines like "C:\cygwin\usr\i686-pc-mingw32". Finally, the output is stored in a file called pruned_dir
Now we do something similar but more complex to the other half of the lines in pruned1:
fgrep bytes pruned1 | cut -c25-40 | tr -d ', ' > pruned_size
The first part (the fgrep) finds all the lines which contain the size description of the directory contents, which should be all of the lines which our previous grep ignored. Then it pipes it to cut, which uses the fixed-width nature of the file size printout to isolate the part of the line where the sizes are. Finally the tr command (notice the argument in single quotes is a comma AND a space) with the -d switch deletes all the commas and spaces from the line. This changes lines like "3 File(s) 1,102 bytes" into lines like "1102". The output is redirected to file called pruned_size.
So to recap where we are: we have one file containing the cleaned up directory names and one file containing the cleaned up content sizes. The two files contain the exact same number of lines (say n lines) and the Xth line in the sizes file pertains to the Xth line in the directory file. Now we want to merge the two in a nice way, which is percisely the purpose of the paste command:
paste pruned_size pruned_dir > merged
As hinted above, paste takes a line from one file, a line from the 2nd file, and prints them on one line. The output is saved in file called merged, and looks like this:
So: each line in the output contains the size and name of all the directories (note that the size still is only of the files directly in those directories)
We can do some interesting things here:
sort -n merged | tail
sort -n will sort the contents of the file merged, numerically. That means it will output the directories, sorted in an ascending way based on their file size. tail will print out the last few lines. So the command above prints out the few biggest directories in your machine.
grep '^0' merged | wc -l
grep -v '^0' merged | wc -l
The first line will find all the lines which begin with zero (ie have zero size) and the second part, wc -l, will return the number of lines produced by that operation.
The second line is the same except the -v causes grep to output lines which do NOT begin with 0. So that line prints out the number of non-zero-size directories. On my machine there are 2042 zero-size directories and 7401 non-zero-sized ones. Of course the zero-sized dirs could contain other dirs which actually do have files.
This page quickly demonstrates how to parse a Windows directory listing using Unix commands in order to parse out some useful data. While in my case this was necessary for a reason, in most cases there are command line and graphical tools that do this, and much better. Instead, the main value of this is to someone trying to learn the use of Unix tools, and this provides a great 'live' example which is neither too trivial nor so obscure that it can't be made sense of.