Bugs and debugging
Comments and questions to John Rowe.
In this course, and in the exercises, we have emphasised the
importance of writing a little code at a time and checking it thoroughly before moving on to the next
piece.
Work in stages, writing a little
code at a time and check it thoroughly
before moving on to the next piece.
Just because we hope our code is bug-free
doesn't mean that it really is!
In practice, however, it's very easy to degenerate to doing
the minimum amount necessary for us to be able to tell our self
that it's working properly, rather than doing everything we can
to find any bugs. This is usually due to a combination of lack
of time, laziness or wishful thinking.
The programs that take a lot of
time to get working are the ones where we have let a bug slip
through and have to find it later, not the ones where we have
spent time checking for bugs as we go.
Image: wikipedia
In countries where scorpions and poisonous spiders are common,
we are advised to carefully shake out shoes and other clothes in
the morning before putting them on, particularly when camping or
living out in the country.
Warm up discussion
- If you were living in fairly rough accommodation in an area
with large numbers of poisonous spiders and scorpions, how
careful would you be shaking out your shoes, underwear etc.
before putting them on them on in the morning? Why?
- . If somebody brought you such an item to put on but were
rather vague about whether they had checked for scorpions
("Yeah, well, I think so..."), how would you feel about putting
that item on?
Remember
It may be unpleasant to shake out a piece of clothing we were
about to put on next to our skin and to see a scorpion fall out,
but it's nothing compared to what would have happened had we not
bothered!
In the case of physical poisonous bugs, the psychology
described above is fairly clear. But for logical bugs
in our code the psychology tends to be reversed: we can tend not
to carefully check for bugs in the hope that if we don't look
for them somehow they won't be there.
Nobody would make that mistake when checking for a poisonous
spider in their underwear so don't make it when checking for
bugs in your code!
Debug thoroughly as you go
It's not the scorpion we shake out of our underwear that we
need to worry about, but the one we don't
shake out of our underwear.
It's not the bug we find as we go
along that takes the time,but the one we don't
find as we go along.
Put checks into your program
Check every class of possibility
It's seldom possible to check every possible combination of
input.
Let's imagine we are programming a generalised, NxN naughts
and crosses program, for example a 5x5 game rather than the
usual 3x3. This has around 25 factorial possible games and we
can't check them all.
Even worse, the quadratic equation a*x*x + b*x + c = 0
has an infinite number of possible coefficients.
It's important to consider every possible class
of cases and to test at least one of every class.
For a quadratic we have the following possibilities:
- Two real roots.
- Two complex roots.
- One repeated root.
- Any or all of a, b and c equal to zero. (a equals zero is
particularly important of course.)
We should make sure we test them all.
Check every class of possibility.
Check that tests fail as well as pass
If we are checking, say, that a number is divisible by four we
must first check that the conditions triggers when the condition
is divisible by four and second that it does not
trigger when the number is not
divisible by four.
Put in checks
If we are dealing with a triangle, check that the angles add
up to two pi, to within a margin of error. (But see below.)
Check that your checks do actually catch the
errors.
Mistakes to avoid
Sadly, these are all things I have seen (and probably done!).
1. Printing out the result without checking it's correct
If we write a factorial function and print out 11 factorial as
a test, we will need some evidence that it's correct - nobody
can work out the value of 11 factorial in their head.
Check that the result is actually correct, not just
that's it's a number!
2. Always using the same test data
It's easy to get into the habit of typing a certain set of
numbers into the terminal until that particular input data gets
handled properly and then moving on. It is even easier when the
test data is in a file.
Try not to always use the same test data.
3. Checking only special cases
For example, when reading in a floating-point number to just
type in integers. Or when solving the quadratic equation a*x*x
+ b*x + c = 0 to make b or c equal to zero, or a equal to
one. There are a number of errors that won't get caught in this
case, for example forgetting the initial minus sign in "-b +
sqrt(b*b - 4.0*a*c)".
4. Failing to check special cases
We should make sure that our program can handle a equals zero
for example.
Check the special cases but not just the special
case.
Using a piece of code which we've only vaguely debugged is
like putting on underwear in the Australian outback which we've
only vaguely checked for poisonous spiders or scorpions: it's
going to come back and bite us somewhere very painful indeed!
Finding bugs
Despite our best intentions, bugs do occur. When they do there
are two main priorities, which often operate in parallel:
Successively partition the problem
If we were playing a game where we choose a (whole) number
between one and a hundred and the other person has to guess it,
a small child might ask "Is it one? Is it two?...". A more
numerate person would start "Is it less than fifty?", thus going
from an "order N" problem to an "order log(N)" problem.
Similarly, if an incorrect value or algorithm has two or three
components, try to work out which of these is wrong, rather than
starting at the first sub-component of the first component, then
the second sub-component and so so.
Narrow down where the bug is,
preferably by successive partitioning
Also be aware that it can be the code that puts the components
together that's in error, or even the debugging code itself.
Check the error isn't in your debugging code.
It's important to realise that every piece of code has two stories:
- What we want it to do
- What we have told it to do.
When debugging a short piece of code we basically go down comparing
these two stories and seeing where they differ.
Do not make the uncouncious mistake of assuming
that what you want your code to do has any effect on what it will do!
Step through the code comparing what you want
the program to do with what you have told it to do.
Example: a function calling other functions
Here's a familiar situation. The following function always
returns zero whenever r is greater than two:
long ncr(int n, int r) {
return npr(n,r)*(1/factorial(r));
}
We can easily see what it's meant to do, if we can't see the
problem by inspection we move on to:
The 80/20 rule is more properly known as the
Pareto principle.
Bearing in mind the 80/20 rule, there are two debugging tools
that will form the majority of what you do. The first is for
programs that run but produce the wrong answer. It's very
simple: Print stuff out.
Get more
information by printing stuff out.
The first thing is to put in some diagnostics:
long ncr(int n, int r) {
long fact, nprval, result;
fact = factorial(r);
nprval = npr(n,r);
result = npr(n,r)*(1/factorial(r));
printf("FACT: n %d r %d fact %ld npr %ld result %ld\n",
n, r, fact, nprval, result);
return result;
Print out information that
tells you the progress of the program.
(Another choice would be to have factorial() and npr()
print out their arguments and results, we shall illustrate this
below). The program produces:
FACT: n 5 r 3 fact 2 npr 60 result 0
This probably isn't what we expect: at first sight both npr and
factorial seem to be doing their job but the result is zero!
Careful thought about the rules of integer arithmetic makes us
realise that: (1/2) is zero not 0.5 so we fix our code
and remove the diagnostics:
long ncr(int n, int r) {
return npr(n,r)/factorial(r);
}
and move on (a little too soon as we shall discover).
Print out information that tells
you whether the program is correct so far (and check it!)
Example continued: a bad function containing a bad loop
A little later on we realise that although ncr() is
no longer giving zero, it's not giving the right answer either!
Our code includes the following factorial function:
long factorial(int n) {
long res = 1;
for (int i = 2; i < n; ++i)
res *= i;
return res;
}
The first thing would be to see if both it and the npr()
function were behaving properly. This time we put all the
diagnostics inside inside factorial() and npr()
so each function we might print out its arguments and what it
returns:
long factorial(int n) {
long res = 1;
printf("FACTORIAL: n is %d\n", n);
for (int i = 2; i < n; ++i)
res *= i;
printf("FACTORIAL returning %ld\n", res);
return res;
}
One point to note here is that it's unwise just to print out a
number without saying what it is (printf("%n\n",n);),
as we will end up with a string of anonymous numbers.
Print out what the number,
etc, is, not just its value, and make sure the message
identifies where it comes from in the program
We give our program data which will call this function with a
sensible value (say 6) and find it returns 120, not 720. That's
probably enough to find the problem, but if we wished to debug
the loop we typically want to print out two things:
- The variables controlling the loop
- What the body of the loop is doing
We therefore write:
long factorial(int n) {
long res = 1;
printf("FACTORIAL: n is %d\n", n);
for (int i = 2; i < n; ++i) {
res *= i;
printf("FACTORIAL: i %d res %ld\n", i, res);
}
printf("FACTORIAL %d returning %ld\n", n, res);
return res;
}
And see:
FACTORIAL: n is 6
FACTORIAL: i 2 res 2
FACTORIAL: i 3 res 6
FACTORIAL: i 4 res 24
FACTORIAL: i 5 res 120
FACTORIAL 6 returning 120
Looking at the numbers we see we have calcultaed 5 factorial not 6
factorial.
We then look at the loop condistion and
see that we have instinctively written i < n
(perhaps because we are so used to looping over array indices)
and fix it to read:
long factorial(int n) {
long res = 1;
for (int i = 2; i <= n; ++i)
res *= i;
return res;
}
Sometimes we try print out a number and get Nan
instead! This is short for Not a number and
happens when we do something invalid such as sqrt(-1)
Nans propagate though calculations:
x = sqrt(-1)
y = x + 1;
cause both x and y to be Nan. A
similar problem occurs when we divide by zero, which gives inf.
(NB Windows often uses #IND or
another three letter abbreviation instead.)
A macro is a more
advanced use of #define which is allowed to have
arguments.
One useful tip, which we also use below, is that the #include
file math.h also defines the macro isfinite()
that returns zero if its parameter is infinity or Not a Number
or one otherwise.
For example, suppose we have an array and for some reason we
think that some of its values are Nan or infinity. We might
write the following loop:
for(i = 0; i < N; ++i)
if (isfinite(x[i]) == 0)
printf("x[%d] is %f\n", i, x[i]);
This will print out i followed by inf, Nan,
etc. for any bad array elements. If we were to use the debugger,
below, we would have a choice of a few other things to do.
See also the section on assert() below.
isfinite(x) checks that x is not
infinity or Not A Number.
Recently I had a problem where a program
sometimes couldn't find a file. I printed out just the filename
but not the directory name because I "knew" what directory it
was looking in. But when I printed out the full file path
("directory/filename") I found it was looking in the wrong
directory.
The executive summary of this section is that we should not
assume that we know what the problem is. After all, if we knew
where the problem was it wouldn't exist!
Check your assumptions
Some things we "know" to be true aren't
When a program is running properly there are all sorts of things
we "know". And sometimes instead of printing out the actual thing
we want to know the value of we print out something we "know" is
the same as the thing we want.
It may not always be the last thing we changed
Most of the time the bug will be in the bit of code we are
working on. But if we can't find it there we need to look at the
code we had previously assumed was bug-free. This is an
unpleasant occurrence (there's lots of code to look through!),
but there again we've been saying this for the whole lecture so
it should not be a surprise by now.
Interface errors: inconsistent assumptions
The worst bugs can be where we make one assumption when writing
a function and another when using it - referred to as interface errors.
These are hard to find because each part of the program seems
to be correct. Get rid of those ambiguities and where that's not
possible say which one you've chosen in the comment at the top
of the function.
Remember the differences tend to be subtle: there are two
different "heat capacities" for example. Or a function may
assume a certain piece of set-up work has already been done
before we have called it but when we call it we may assume that
the function itself does it for us. (In this latter case,
wherever possible the function should check it has been done,
and in general functions that do a complete job are better than
ones that only do part of it.)
Interface errors, where we make one assumption when
writing a function and another, slightly different one, when
using it are particularly hard to spot.
Check the diagnostics
We mentioned this above, but sometimes the code is actually
correct, it's our diagnostic that's wrong.
Using the debugger
The printf() technique is not very helpful when a
program crashes with a segmentation fault or other memory error.
All we get is a message saying it has crashed, with no idea
where or why.
I have seen bugs so bad that the debugging
information itself has been destroyed but these cases are rare.
Fortunately the debugger can tell
us:
- Where the program was when it crashed
- What values the variables had at the time.
For the debugger to be useful we must have
enabled the appropriate options when we compiled the program.
The debugger can tell
us here the program was when it crashed and what values the
variables had at the time.
Showing variable values ("watches")
By default the debugger will show us where a program was when
it crashed but not the variable values. This is easily fixed,
and also needs to be done just once, from the Debug
-> Debugging windows -> Watches menus item:
Enabling the "Watches" window.
Viewing the Call stack
- If the "Call stack" entry is not checked in the above
picture select that too. This is the window that shows what
functions were active at the time.
Enable the Call stack and Watches windows.
Moving windows
Note: Code::Blocks has quite a sophisticated windowing system
and free-floating windows can be dragged into the "main"
display. The images below will show the "watches" window having
been dragged to the left of the source-code window.
Starting the program under the debugger
This can be done either from the Debug ->
Start/Continue menu or just br pressing F8:
Running under the debugger
The purpose of the debugger is to provide us with a window
like the one above. The above example is for a program that
has crashed due to trying to modify a read-only string. It has
three main windows (NB: if either or both of the "Watches" or
"Call stack" windows does not appear enable them from the "Debug -> Debugging windows" menu).
filename.c: source code
This shows the statement where the program crashed.
Watches: variables and their values
The main items of interest to us are the arguments, local
variables and global variables.
When viewing the values of arrays passed as
arguments it can be better to go up to the function where they
were first declared by double-clicking the function name
inside the call stack window (see example below).
Call stack: function list
This shows us which function we are in, badfun(),
the function that called it, and so on up to main().
In our case there are just two functions.
The function names are double-click-able: if we double-click on the word main
inside the call-stack window the other windows will show us
the state of main() not badfun(). The
ability to go up and down the function list, examining the
variables in each function can be extremely useful.
In the following picture we have double-clicked on main()
inside the call-stack window. This could be because we wanted
to know where badfun() was called from or becase we
wanted to view an array declared inside main() whose
address was passed to badfun()
Click on the image for a picture of the whole screen.
Use the Call stack and Watches windows to move up
and down the list of active functions and view their variables.
Viewing arrays
Arrays are only truely defined in the function that declared them
(subfunctions are only pased their addresses) so the debugger may
only show them when viewing teir original function.
The assert() macro, which is defined inside assert.h,
has a simple and brutal function: if the expression in
parentheses in false (zero) the program is killed there and
then via the abort signal. For example, in the above
case where we thought that one or more of the elements of an
array were infinity or Not a Number, we could have written:
for(i = 0; i < N; ++i)
assert(isfinite(x[i]));
assert(expression)
kills the program if expression is false
(zero), it is only useful when running under the debugger.
For obvious reasons, assert() is only really useful
if we are running under the debugger. For equally obvious
reasons we don't want to leave it enabled when we release the
program to the people who are going to use it. Therefore assert()
can be turned off by defining NDEBUG before
including assert.h:
/* Turn off assert */
#define NDEBUG
#include <assert.h
In this case the pre-processor replaces assert()
with a space.
The text of each key point is a link to the place in the web page.