Using Unix Tools to Find Biggest Subdirectories on your Windows Hard Drive
 

 I was trying to help my girlfriend figure out where all her disk space had gone when I realized that Windows doesn't provide any tools to really figure out where the space is being used up. In particular, there's no way to glance at a directory listing (perhaps forthe root of the C:\ drive) and see how much disk space each of the subdirectories consumes. You can query the size of each directory by right-clicking its name and selecting properties, but there's nothing that gives you an overview. 

(of course there're 3rd party tools to do this but that's no fun)

I decided to see how quickly I could write up code that could answer the question: how much space is each subdirectory using up?

The thought of using Unix tools for this is a bit contried because... well because we're analyzing Windows drives here. And of course if you have Cygwin installed (which is how I am doing this) then you have the du command at your disposal. But at any rate, the point is to come up with a challenge and overcome it. The fact that I could do this in a matter of minutes speaks to the versatility of Unix filters.

The first step in all of this is to gather the information about the files on my system. For this I used the good old DIR command that we all know and love from the DOS days. In particular, I got a command line window on Windows XP (Start->Run->CMD), changed to the root directory of Drive C (cd c:\) and ran the following command:

DIR /S > dirlist

The /S switch to DIR causes it to recursively go into each subdirectory, and to print the contents of it as well as any of its subdirectories. The whole thing is stored in file called dirlist, which, when the DIR is done, looks like this:

 Volume in drive C has no label.
 Volume Serial Number is ACDC-5495

 Directory of C:\

06/27/2002  09:28 PM                 0 CONFIG.SYS
09/10/2006  10:51 AM                81 CTX.DAT
04/22/2006  04:44 PM    <DIR>          cygwin
07/22/2006  02:03 PM    <DIR>          MATLAB6p5
10/16/2006  08:52 PM    <DIR>          Program Files
10/15/2006  08:56 PM    <DIR>          WINDOWS
               4 File(s)             81 bytes

  Directory of C:\MATLAB6p5

07/22/2006  02:03 PM    <DIR>          .
07/22/2006  02:03 PM    <DIR>          ..
07/22/2006  01:44 PM    <DIR>          bin
07/22/2006  01:44 PM    <DIR>          demos
07/22/2006  01:44 PM    <DIR>          extern
07/22/2006  01:49 PM    <DIR>          help
07/22/2006  01:44 PM    <DIR>          ja
07/22/2006  01:44 PM    <DIR>          java
06/20/2002  08:03 PM            47,565 license.txt
07/22/2006  01:50 PM               625 MATLAB 6.5.lnk
07/22/2006  01:44 PM    <DIR>          notebook
07/22/2006  01:48 PM    <DIR>          rtw
07/22/2006  01:45 PM    <DIR>          simulink
07/22/2006  01:45 PM    <DIR>          stateflow
07/22/2006  01:47 PM    <DIR>          sys
07/22/2006  01:50 PM    <DIR>          toolbox
07/22/2006  01:50 PM    <DIR>          uninstall
07/22/2006  01:47 PM    <DIR>          webserver
08/11/2006  05:36 PM    <DIR>          work
               2 File(s)         48,190 bytes

 Directory of C:\MATLAB6p5\bin

07/22/2006  01:44 PM    <DIR>          .
07/22/2006  01:44 PM    <DIR>          ..
04/02/2002  08:29 PM             1,230 matlab.bat
08/08/2006  08:14 PM    <DIR>          win32
               1 File(s)          1,230 bytes

 

and so on... you can see here that each directory is printed, then each directory is entered, printed, its subdirectories entered, etc etc. The point to note here is that the name of the directory being processed is printed on top of each section (eg   Directory of C:\MATLAB6p5\bin) and that the size of the files in that directory ( but not of files in its subdirectories) is printed on the bottom of the section. EG the one file which lives in the bin directory above is 1.2K while the 2 files one level up add up to 48k. The reason that the file count for C:\ says 4 but you only see 2 is because I truncated that section ;)

From this point on, we're going to switch to Unix tools. I did that by running the subsequent commands in Cygwin. Another alternative would be to transfer the dirlist file to a Unix machine. Yet another approach would be to download a Windows version of the Unix tools. 

The first thing I do is realize that the only two pieces of data that really matter from the above output are the name of the directory and the size of the files in that directory. For example, with respect to the last section above, I only care that C:\MATLAB6p5\bin takes up 1,230 bytes. How do I cut down on my output so it only contains that data?

The thing I noticed is that both the line containing the directory name (" Directory C:\MATLAB6p5\bin") and the line which reports the size of the dir's content begin with a space, while the lines between them do not begin w. a space (they begin with a date)... To get rid of all the lines which do not begin with space, I did the following command:

grep "^ " dirlist > pruned1

The above command scans the file dirlist and only takes lines which contain the regular expression "^ ". ^ means beginning of the line, so "^ " means "lines which have a space right after the beginning." We store the result in a file called "pruned1"

On close expection of pruned1, we see clearly now that the original output had a few lines of header above and footer bellow the entire output, which gave you some big totals. For our purposes, we don't need that, so I just used a text editor to cut away those lines, such that the very first line of pruned1 is now the " Directory of C:\" and the last line tells me how many files and their total size for the very last directory in the output. In general, pruned1 looks like this (taking random section of the file as an example)

 Directory of C:\cygwin\usr\doc\ctags-5.5
               8 File(s)        159,140 bytes
 Directory of C:\cygwin\usr\doc\Cygwin
               2 File(s)          7,079 bytes
 Directory of C:\cygwin\usr\i686-pc-cygwin
               0 File(s)              0 bytes
 Directory of C:\cygwin\usr\i686-pc-cygwin\bi
               0 File(s)              0 bytes
 Directory of C:\cygwin\usr\i686-pc-mingw32
               3 File(s)          1,102 bytes

We want to make the output even more concise, for which we will use the 'paste' command later. First, I want to make a file containing just the directory names, and one containing just the sizes.

fgrep Directory pruned1 | cut -c15- > pruned_dir

This does three things: first, it find each line which contains the "Directory" heading (which happens to be every other line) in file pruned one. Then it pipes it to the cut command which takes -c15- parameter to mean "print out each line starting at character 15". I use 15 because that's where the beginning of each directory name actually is (ie C:\...). So the effect of that cut command is to convert lines like " Directory of C:\cygwin\usr\i686-pc-mingw32" into lines like "C:\cygwin\usr\i686-pc-mingw32". Finally, the output is stored in a file called pruned_dir

Now we do something similar but more complex to the other half of the lines in pruned1:

fgrep bytes pruned1 | cut -c25-40  | tr -d ', ' > pruned_size

The first part (the fgrep) finds all the lines which contain the size description of the directory contents, which should be all of the lines which our previous grep ignored. Then it pipes it to cut, which uses the fixed-width nature of the file size printout to isolate the part of the line where the sizes are. Finally the tr command (notice the argument in single quotes is a comma AND a space) with the -d switch deletes all the commas and spaces from the line. This changes lines like "3 File(s)          1,102 bytes" into lines like "1102". The output is redirected to file called pruned_size.

So to recap where we are: we have one file containing the cleaned up directory names and one file containing the cleaned up content sizes. The two files contain the exact same number of lines (say n lines) and the Xth line in the sizes file pertains to the Xth line in the directory file. Now we want to merge the two in a nice way, which is percisely the purpose of the paste command:

paste pruned_size pruned_dir > merged

As hinted above, paste takes a line from one file, a line from the 2nd file, and prints them on one line. The output is saved in file called merged, and looks like this:

0       C:\MATLAB6p5\java
0       C:\MATLAB6p5\java\extern
7593    C:\MATLAB6p5\java\extern\EmacsLink
280541  C:\MATLAB6p5\java\extern\EmacsLink\lisp
4729019 C:\MATLAB6p5\java\jar
3043139 C:\MATLAB6p5\java\jar\toolbox
4600573 C:\MATLAB6p5\java\jarext
56824   C:\MATLAB6p5\java\jarext\commapi

So: each line in the output contains the size and name of all the directories (note that the size still is only of the files directly in those directories)

We can do some interesting things here:

sort -n merged | tail

sort -n will sort the contents of the file merged, numerically. That means it will output the directories, sorted in an ascending way based on their file size. tail will print out the last few lines. So the command above prints out the few biggest directories in your machine.

grep '^0' merged | wc -l

grep -v '^0' merged | wc -l

The first line will find all the lines which begin with zero (ie have zero size) and the second part, wc -l, will return the number of lines produced by that operation.

The second line is the same except the -v causes grep to output lines which do NOT begin with 0. So that line prints out the number of non-zero-size directories. On my machine there are 2042 zero-size directories and 7401 non-zero-sized ones. Of course the zero-sized dirs could contain other dirs which actually do have files.

 (tbc)




This page quickly demonstrates how to parse a Windows directory listing using Unix commands in order to parse out some useful data. While in my case this was necessary for a reason, in most cases there are command line and graphical tools that do this, and much better. Instead, the main value of this is to someone trying to learn the use of Unix tools, and this provides a great 'live' example which is neither too trivial nor so obscure that it can't be made sense of.