Catching Bugs (Week 14)

Bugs happen no matter how hard we try to prevent them. Once our software is "out there" in the wild we want to know when things go wrong so that we can fix them more easily. Even when users tell us about problems, they usually don't have enough context to let us debug efficiently. For example, it is common to hear from a user that a piece of software "crashed", but did it present an error message first? Did it actually crash, or freeze? Did the entire application crash, or just one piece of it? Users often can't answer these questions, and they certainly can't tell us anything about the internal state of the application when it crashed (they are, after all, users, not developers).

So what can we do to improve this situation? Quite a lot, as it turns out. We will take a brief look at just a couple of strategies we can use to give ourselves more information when things go wrong, and we will work through some examples in class to augment the information found here.

Logging

Most people who learn computer programming discover very quickly that "print" statements are a pretty useful way to debug their software. Even professional developers frequently debug using print statements because it can be quicker and easier than more advanced techniques. But print statements don't help us when our software is run on a user's computer or on a server somewhere out there in the "cloud". Print statements also don't tell us when the problem happened, or provide any additional context. They're also kind of annoying, so most people delete them before running the program "for real".

This is where logging comes in.

Examples

There are plenty of logs available on your system already, and quite a few different tools you can use to look at them. First, let's look at a tool called "dmesg". You may have been told to use this before if your computer crashed or a piece of hardware failed to operate properly. It can do many things, but one of the most common uses is to view kernel messages, which are usually generated when device drivers have things to say.

[16207.135216] wlp58s0: disconnect from AP d8:c7:c8:f6:8a:98 for new auth to d8:c7:c8:f6:8a:90

[16207.144301] wlp58s0: authenticate with d8:c7:c8:f6:8a:90

[16207.151415] wlp58s0: send auth to d8:c7:c8:f6:8a:90 (try 1/3)

[16207.219174] wlp58s0: authenticated

[16207.220250] wlp58s0: associate with d8:c7:c8:f6:8a:90 (try 1/3)

[16207.247856] wlp58s0: RX ReassocResp from d8:c7:c8:f6:8a:90 (capab=0x431 status=0 aid=1)

[16207.254566] wlp58s0: associated

For example, the messages above occurred when I opened my laptop in the UC Commons and it connected to UM's wireless network. If something had gone wrong, it is likely that a message would have appeared here and given me some hint as to what happened.

But why does this matter for software developers? Well, one of the most common things that might go "wrong" with a piece of software is that it runs into a bug. As developers, we want to know about it when this happens, and we want to know as much information as we can about what was going on at the time so that we can fix our software.

Logs like the ones above can help users (or other people running our software) provide us with the detailed information we need without first becoming experts on how our software works. This also allows us to recover from crashes, for instance, without missing the fact that a crash happened in the first place.

Let's look at another example. There is a file on most Unix systems called "/var/log/auth.logs" that is used to keep track of authentication attempts to the machine. Here's an example of what happens when I try to log into my laptop as the root user (which is not allowed for security reasons).

Nov 29 12:54:18 piketty gdm-password]: pam_unix(gdm-password:auth): authentication failure; logname= uid=0 euid=0 tty=/dev/tty3 ruser= rhost=  user=root

Nov 29 12:54:22 piketty gdm-password]: pam_succeed_if(gdm-password:auth): requirement "user != root" not met by user "root"

We can see when the login attempt occurred (Nov 29 at 12:54pm). We can also see some information about where the attempt came from (GDM is Gnome Display Manager which handles user sessions). We see what went wrong ("authentication failure") and we see some context about the user account (UID 0 is the root user's ID, and the name is, of course, "root"). Finally, we see some information about why the attempt failed, the system is configured to reject login attempts for the root account.

Monitoring

Logs are great for telling us what happened when things go wrong. But sometimes we might not even realize that something went wrong, or we might not realize it until our users have already been harmed in one way or another. For this reason, it is common for modern software systems to be actively monitored. This can take many, many forms, but the basic idea is that a piece of software actively watches for problems and notifies the developers when they occur.

One way to implement monitoring is to watch an application's logs. It is common for logs to have different "levels". For example, a common set of levels is "info", "warning", "error", "severe". Each log message then has a level associated with it to aid in filtering and searching. A monitoring system can pay attention to the logs as they are generated and could, for instance, ignore everything other than "severe" messages. Or it could take different action depending on the level detected. Perhaps it could send an email to the developers when there is an "error" and send an email and text message when there is a "severe" message.

Another way to monitor server software is to attempt to talk to it every now and then to see if it is still working. This is sometimes called a "health check". This strategy is very common for web services and can even take on some aspects of functional testing if the response from the server is analyzed for correctness as well.

Google Sites

Report abuse