Tuesday, February 24, 2009

Forward Declarations in C++ and runtime Bug/Error

Forward declaration is a pretty simple concept to understand. As everyone knows it is a way of declaring the class names before a piece of
code that uses it (where we do not want to include header files) so that the compilation happens properly and does not complain about
missing class name symbols. Now why will someone not want to include header files and instead have a forward declaration.This is
especially helpful when there could be multiple header files with conflicting definitions (not for the class but let us say a macro like
MAX_PATH) and we do not want to include one header file versus including the other in a given cpp file. In this case if we need to use the
objects of the class whose header file is omitted in the code of this cpp file we need to have some symbol declaration so that the
COMPILATION happens properly. Later the linker when linking the .obj files will properly declare the sizes of the objects.
Now there is a very grave pitfall that can go unseen and result in a difficult to find bug that could end up you spending lot of valuable
time looking at the memory and guessing it. I took about an hour to figure this out when I happened to work with someone who wrote such
piece of code.
First and foremost the scenario is as follows. Let us look at the declarations of our problem candidates
=====
//In A.h
class A
{
public:// some func
int array1[5];
};

//In Ace.h
#pragma once
class Ace
{
public:
int var;
};

//In newder.h
#pragma once
class Ace;
class Der;
class newDer
{
public:
void * membervar; //points to an instance of Der as supposed by the programmer
void setMembervar(void*);
void * getMembervar();
Ace * getDerVar();
};
//In newder.cpp
#include "A.h"
#include "newder.h"
void newDer::setMembervar(void * setvar)
{
this->membervar = setvar;
}

void * newDer::getMembervar()
{
return this->membervar;
}
Ace * newDer::getDerVar()
{
//Uncomment this to understand what could be happening inside the pointer conversions
//Ace * someptr = (Ace*) this->membervar;
return (Ace*) (Der*)this->membervar;
}

//In Der.h
#pragma once
class Der: public A, public Ace
{
public:
int * pointer;
void execute();
};
//In der.cpp
#include "A.h"
#include "Ace.h"
#include "Der.h"
#include "newder.h"
#include "stdio.h"
void Der::execute()
{
newDer n;
n.setMembervar((void *)this);
printf("Before: Der* 0x%08x\n",(Der*)this);
Der* d = (Der*) n.getDerVar();
printf("After: 0x%08x\n",d);
printf("Pointer value diff %d\n",(unsigned int)this-(unsigned int)d);
}
int main()
{
printf("size of int = %d\n", sizeof(int));
printf("size of long = %d\n", sizeof(long));
void * someptr;
printf("size of pointer = %d\n", sizeof(someptr));
printf("size of class A = %d\n", sizeof(A));
printf("size of class Ace = %d\n", sizeof(Ace));
printf("size of class Der = %d\n", sizeof(Der));
Der d;
d.execute();
}

====

This seemingly naive looking code has a very subtle bug that can make the code crash and this bug is very difficult to scoop into when it
is production like (with lot of other code). For the sake of simplicity the above code only revolves around the problem and not any
business logic. If you run this code you will see the following output:
size of int = 4
size of long = 4
size of pointer = 4
size of class A = 20
size of class Ace = 4
size of class Der = 28
Before: Der* 0x001bfae8
After: 0x001bfad4
Pointer value diff 20

Notice the Before and After pointer values. Interesting isnt't it. I would expect the pointer value to be the same. But it is moved.
Imagine myself using d inside execute function to access the pointer member variable of Der class, like d->pointer and then use this
pointer to access some function etc...
How could the value of the address of Der change before the call and after the call. I guess you might have figured out by looking at the
extra print statements with the sizes of the classes. Yes, the bug is in the function "Ace * newDer::getDerVar()". When we assigned Der*
to the void * membervar in setMembervar everything was fine. membervar will still be pointing to location 0x001bfae8. But when we return
from getDerVar and cast it to Ace* the problem surfaces. Look carefully into the header file newder.h. The class Ace and Der are declared as FORWARD DECLARATION. Rings bells? Yes the class size of Ace* and Der* when newder.cpp was compiled was 0. So it did not put any offsetting instructions in when the obj was prepared. Usually the compiler offsets into a given pointer when it is typecasted in the hierarchy. That is missing in this newder.obj. So when the code gets executed the address to which membervar is pointing to (0x001bfae8) is passed out as Ace*. Now continuation of the bug is in the function "void Der::execute()". When this function is compiled and the statement "Der* d = (Der*) n.getDerVar();" is compiled the compiler knows that getDerVar() returns Ace* and we are telling it that this is actually a pointer to Der* which contains A and Ace in it content serially as defined. So it puts some offsetting code to rollback the pointer by 20 bytes since it needs to get to the root address of Der class and there we have our horrible bug.

This type of code is not healthy for large code bases. I usually avoid forward declarations as much as possible. Instead of forward declarations if we have header files included in newder.cpp for Ace.h and Der.h we విల్ not see this problem.

Saturday, February 21, 2009

Delay Loading using Visual Studio - Pitfall

This is a small post on delay loading using Visual Studio. As everyone knows that we can dynamically load a dll in a process. Let us say binary B1 depends on binary B2, one way of doing this, is to put a hard dependency on the B2 in the project settings of B1. This actually makes B2 load into the process where B1 is already loaded (or if B1 is the exe then B1 is the process itself) while starting the process. You can actually view this by starting your process in the debugger (Windbg) using gflags set with "Show loader snaps" checked. It shows all the binaries that are loaded when the process comes up.

The second option is to call LoadLibrary() in the code of B1 to load B2 and then call GetProcAddress(). This is the best delay loading method. As per my experience it wards off all non-execution errors in the field (once your solution is installed in the customer environments). But this is tedious if there is a large number of functions to be imported from B2 into B1 and used. You will end up writing function pointer definitions in B1 for every function that you need to use from B2.

The third option is a middle way of both the approaches. It is to delayload B2 in B1 using the project settings "Delay Loaded Dlls" in the project settings of B1 (under Linker options). But there is a specific way of doing this.

You will have to put B2.lib in the "Additional Dependencies" section as if you are hard depending(1st case) and in addition you will have to put B2.dll in the "Delay Loaded Dlls" entry in the same tab in the project settings to make it delay loaded. Once you do this, B2 will be loaded only when any code path using B2's functions or classes will be hit in B1. Until then B2 will not be loaded into the process space of B1.

Pitfall:
Here most of the developers, when they have multiple delay loaded dlls, let us say B2 and B3 to be delay loaded in B1 type the entries as 'B2.dll B3.dll' (ofcouse without the single quotes). They do this because they usually copy/paste or go with the notion of the same delimiter as in "Additional Dependencies" section. But doing so will actually tell the compiler that there is a binary called "B2.dll B3.dll" to be delay-loaded into the process of B1. But since the same entry is not found in "Additional Dependencies" section it ignores this directives and happily links it. Now the consequence is that B2.dll and B3.dll are linked as hard dependency into B1. And you wonder why they are getting loaded like this (let us say because of package differences B2.dll and B3.dll do not get installed always along with B1.dll and so result in load failures of B1.dll when you think everything is right).

The delimiter in the delay-loaded section is a semi-colon ";". If we put the entry as 'B2.dll;B3.dll' in the "Delay Loaded Dlls" section then everything goes fine as expected.

Second pitfall is that if you have a delay loaded dll B2 in B1 and in course of time the code in B1.dll that referes to B2.dll is yanked out (changing code because of a any reason), then the B2.dll's delay load directive you have put will be ignored during linking. Consequence is that your binary B1.dll now hard depends on B2.dll!