(Last Mod: 27 November 2010 21:38:39 )
In a prior lesson we developed a function that allowed us to print out integers in a specified number base. One of the bases we could use was base two - or binary. The question is then, does this function allow us to view the internal status of the bits in a data object?
The answer is a resounding...sometimes.
That function prints out the binary representation of the value stored in a variable - there is a very large distinction between the value and how that value is represented in memory. For instance, we might have the same value, 42, stored in memory in four places. First it might appear as a 16-bit integer, then it might appear elsewhere as a 32-bit integer. Someplace else it might appear as a 32-bit single-precision floating point value and, finally, it might appear in the fourth location as a 64-bit double-precision floating point value. While the value, namely 42, stored in all four locations is the same, the pattern of 1's and 0's stored in each of those segments of memory are very different - and not just in how many bits are used to store the value.
Consider a situation where we were asked to calculate how much water we would have if we added 6 gallons to a tank presently containing 103 pints. Whether we express the amount of water in the tank in gallons or pints or liters or any other unit of fluid measure does not change the value - which is the amount of water in the tank - it merely changes how that value is expressed, or represented.
In order to perform the addition we would need to express both values - the water presently in the tank and the water to be added to the tank - using the same representation. After doing this, we can add the two values together and obtain a new value expressed in a particular way, in this case that might be 151 pints. We can now, if necessary, convert that to a different representation, say 18.875 gallons, but in doing so we have not changed the value - the amount of water in the tank has not changed, merely its representation.
Consider the following code fragment:
int k, m;
double x, y;
k = 15;
x = 15;
m = k + x;
y
= k + x;
Although k and x have the same value, the bits actually stored in memory are very different because they are stored using different representations.
When we add k and x, the compiler has to translate the representation of at least one of the values so that they are compatible. Once they are in the same form, the operation can be carried and the result is in a particular representation - or data type. In this case the result of the addition operation would be a value represented as a double. When we then store that value in the variable m, it is converted to the representation of that value as an int. When we store it in the variable y, it can be stored directly using the same representation it is already in. But, in both cases, the values stored in the two variables are the same (because both variables use a representation capable of representing the value 30).
Most of the time, we leave it up to the compiler and the processor to deal with issues associated with representation. We don't want to deal with those issues - that's why we spent good money on a compiler. For instance, the technique that we used to print out a value in a particular number base worked with the value - we didn't care how that value was actually represented in memory. But, as a result, we would get the same results - the same pattern of zeros and ones on the screen - regardless of what the bit pattern actually was in memory.
The C Language Standard recognizes that, in almost all cases, we concerned with working with values, and not their representations, and therefore it places relatively few restrictions on how values must be represented internally. Those issues are left up to the discretion of the team developing the compiler. If they can use a different representation and get better performance, then they have the option to do that. With almost all data types and operators the compiler designer is expected to perform whatever translations and manipulations are necessary to implement the expected behavior on the values that are being used. If we try to rely on how those translations and manipulations affect the internal representations we do so at our own peril.
But all is not lost. The C Language Standard recognizes the programmers do, at times, need or want to deal with the internal representations directly. To make this possible, a limited set of data types and operators are required to have well defined effects on the internal representations stored in memory. Because the need to use these capabilities is rather rare, and because supporting these behaviors is an added burden on the compiler designer which could potentially limit their ability to generate well-optimized executable code, the elements that are defined at this level are kept to a minimum. We must be careful not to stray outside of those elements and assume that what works for them works for other data types and operators otherwise we quickly find ourselves taking a walk through the world of undefined behavior.
Many C programmers have a tendency to make wrong assumptions about how data is represented internally or how we can interact with with that data. Some of these assumptions are quite reasonable - but that doesn't change the fact that they are wrong. In many cases it is a matter of the text and/or course that the programmer learned from neglecting to discuss these details. This is understandable since C is a language that can be worked with at many different levels of detail and any treatment must decide what levels to focus on and what levels to bypass.
It is recommended that you examine the more common fallacies that people have about C to make sure that assumptions you have made to not get in the way of understanding how to work with the internal representations.
So, after dispelling all of these fallacies regarding tricks we can use to get at the internal data representation, what's left? What's left is a small set of highly reliable tools. The very fact that the set only contains a few elements works in our favor because we only have to learn how to use those elements.
unsigned char
Data TypeThe only - let's rephrase that, the ONLY - data type whose behavior is completely defined for all bit positions is the unsigned char. The characteristics we can count on with this data type are:
All other data types have at least one characteristic that is implementation-defined.
The bitwise operations are carried out independently for each bit position within an integer with no interaction with any other bits in the value. The operations available are:
~
Bitwise NOT& &=
Bitwise AND| |=
Bitwise OR^ ^=
Bitwise XOR<<
Shift Left>>
Shift Right&
Address of object*
Value at addressIn addition, we can always use any of the other operators, particularly the arithmetic operators, because we know the required characteristics of the internal representation of values in an unsigned char data type - namely that they be represented using pure binary.
, relational, and logical operators, knowing that the values stored in an, and each successive bit increasing the exponent by one
Any other representation has at least one characteristic that may be implementation defined - larger data types by have how to
Before we look at the tools that are available to us, let's examine a few of the ones that aren't - even though many programmers mistakenly assume that they are.
unsigned char
Lets say that we have a value stored in an unsigned char an we want to print out the bit pattern that is stored there. Since we know it is pure binary, we could use the functions we have already developed to print out the value stored there in binary. This would work. But let's see how to do it using the tools in our Bit Banging Bag.
Top Level Pseudocode:
Let's first review some of the things we already know that apply to these tasks:
unsigned char byte, mask;
/* Create bit mask for MSB *.
mask = 1 << (CHAR_BIT - 1);
/* Print out bits one-by-one starting with MSB */
while (mask)
{
PutC((byte & mask)? '1':'0');
mask >>= 1;
}
If we look at this code, we see that it actually lends itself to a very compact representation using a for() loop:
unsigned char byte, mask;
for( mask = 1 << (CHAR_BIT - 1); mask; mask >>= 1)
PutC((byte & mask)? '1':'0');
unsigned char
The code above works fine if the data is of unsigned char - but what if it is of type signed char.
Before getting into how we can do it, let's be sure we understand the two tempting ways that we cannot use.
We cannot do something like the following:
signed char data;
unsigned char byte, mask;
byte = data;
for( mask = 1 << (CHAR_BIT - 1); mask; mask >>= 1)
PutC((byte & mask)? '1':'0');
This assignment operators work with values and make whatever conversions in
the representation is necessary to maintain the same value (to the degree
possible) under the new representation. The above code will perform an
implicit conversion from a signed to an unsigned char and so the data in
byte
may or may not have the same bit pattern as the data in
data
.
Another way that will not work:
signed char data;
unsigned char mask;
for( mask = 1 << (CHAR_BIT - 1); mask; mask >>= 1)
PutC(( (unsigned char) byte & mask)? '1':'0');
This results in an explicit conversion and, unlike the previous case where the compiler will probably issue a warning regarding possible loss of data, explicit conversions like this generally suppress such warnings. Other than that, the two code fragments will produce the same, potentially wrong, output.
So how do we do it right?
The key is to bypass any conversions on the data, implicit or explicit, that the compiler might apply. We do this by using the address at which value is stored instead of the identifier name. An identifier name has a defined data type associated with it that the compiler knows about and will use whenever we access the value stored there. But an address is just a number - namely the number of the memory location where the value is stored.
We store addresses in variables of type pointer. A pointer variable is
identified by preceding it with an asterisk when it is defined. Pointer
variables also have a data type associated with them so that, when we access
the data at the address stored in the pointer the compiler knows what type of
data it is and can do all of the things that it would do if we used a normal
identifier. Therefore, to print out the internal bit pattern stored in the
signed char data
, we do the following:
signed char data;
unsigned char mask;
unsigned char *ptr;
ptr = (unsigned char *) &data;
for( mask = 1 << (CHAR_BIT - 1); mask; mask >>= 1)
PutC((*ptr & mask)?'1':'0');
Notice this is very close to our first incorrect method. The differences are as follows:
We declared a pointer to an unsigned char called ptr
.
We could have kept the name byte
, but it is traditional to
use 'p' or 'ptr' somewhere in the names of pointers just to help us humans
keep track of what's what. Since we only have one pointer in this code
fragment, there's no need to go beyond that.
The values in pointer variables are addresses at where other things are
located. The ampersand used in this context is the "address operator" (not to
be confused with the bitwise-AND operator) and computes the address of where
the variable data
is located. The address of data
is therefore the value that gets stored in the variable ptr
.
But, if performing explicit conversions are to be avoided, why do we have
the typecast operation on this line? Because &data
is of
type pointer to signed char
and we are storing it in a
variable of type pointer to unsigned char
. The compiler
will probably complain about a "suspicious pointer conversion" and the
typecast tells the it that we really do want to make this assignment. The key
is that the value whose representation is getting changed, if necessary, is
the representation of an address - not the representation of the data stored
at that address.
The final change is that we use *ptr
in the putc()
call. The asterisk used in this context is the "dereference operator" (not to
be confused with multiplication). You can read this operator as something
like, "The value stored at the address contain in ...". So you would read
*ptr
as "the value stored at the address contained in
ptr
".
Since the compiler has to know how values are represented in order to work
with them, it has to know how to interpret the value stored at the address
contained in ptr
. It does this by looking up the data type
associated with the variable ptr
which, in this case, is
an unsigned char
. At the time that this line of code is
executed, the compiler has no idea how the value stored in ptr
got there. It doesn't need to know nor does it care. Likewise, it doesn't know
how the the bit pattern it finds at the address contained in ptr
got there. They are merely bits that it is now being told to interpret as an
unsigned char
.
To make this block of code requires another little trick. The function is as follows:
int PutByte(void *address)
{
unsigned char mask;
unsigned char *ptr;
int chars;
ptr = (unsigned char *) address;
for (chars = 0, mask = 1 << (CHAR_BIT - 1); mask; mask >>= 1)
{
PutC((*ptr & mask)? '1':'0');
chars++;
}
return chars;
}
As with previous output functions that we have developed, the addition of
the variable chars
lets it keep track of how many
characters are actually printed so that we can return that to the calling
function. With this feature, our program can keep track of things like how
long a line of text has become so as to decide when to issue a new line
character. For our purposes here it is completely extraneous.
We might have been tempted to declare the variable ptr
in the function header (as an unsigned char
) and this
would have worked as expected. The problem is that we would have gotten
"suspicious pointer conversion" warnings every time we passed the address of a
variable to the function unless we did an explicit conversion to type
pointer to unsigned char
. But we can assign the
value stored in any pointer, regardless of type, to a void pointer
without generating a warning because a void pointer
has no
type information associated with it. It is, in essence, a pure pointer in that
it only contains an address value.
The consequence is that we can't use a void pointer
for
very many things. We can't access the value stored at the address pointed to
because the compiler has no information about how to interpret the bits it
will find there. We can't perform pointer arithmetic using a void
pointer
because the compiler doesn't know how big each element in
that block of memory is.
By using a void pointer
we have taken on the
responsibility of keeping track of that information separately and on our own
and, if the compiler needs it, we will give it that information by some other
means. In this function we do that by copying the value of the void pointer to
the variable ptr, which has type information bound to it. Since the value we
are storing there does not have type information, we are required to
explicitly convert the void pointer
to a pointer with type
information via a type cast. In essence, we are using ptr as a "middleman" in
this process and we could get by without it. If we wanted to use address
directly in our PutC() function call, it would have looked like:
PutC( (*((unsigned char *)address) & mask)? '1':'0');
First we cast the value stored in address from a void pointer to a pointer to unsigned char. We then dereference this pointer to get at the value stored there. Finally we perform a bitwise-AND between that value and the value in mask as before. This is a perfectly legal statement that potentially saves us both a bit of memory and a bit of processing time (the compiler might be clever enough to optimize out the extra overhead in the way we originally did it). But the logic in the statement has become rather convoluted and it is generally better to break the logic up into smaller, more easily understood steps.
The only thing left to accomplish is being able to print out the bit pattern for a value stored in a data type that uses more than one byte. This process is quite straightforward. We start at the first byte, print out its bit pattern, then move on to the next byte. We need to know two things - where the individual bytes are stored and how many bytes total we have to work with.
The first issue is taken care of by the C language specification:
So if the address of an eight-byte value is memory location 300, we know that the eight bytes are stored at memory locations 300 through 307.
The second issue is taken care of by the sizeof
operator. This operator returns the number of bytes required to store the type
of object indicated as its operand. We many either use an actual data type,
such as:
sizeof double /* Bytes required to store a value of type double */
sizeof int* /* Bytes required to store a pointer to a value of type int */
Or we my use the name of any object that the compiler knows the type of:
int k;
double x;
double *ptr;
void *vptr;
x = 3.14;
ptr = &x;
vptr = ptr;
k = sizeof x; /* Bytes required to store a value of type double */
k = sizeof ptr; /* Bytes required by a pointer to a double */
k = sizeof vptr; /* Bytes required by a void pointer */
k = sizeof *ptr; /* Bytes required by the value pointed to by a pointer to a double */
k = sizeof *vptr; /* ERROR - Bytes required by the value pointed to by a void pointer */
All of these are completely well defined except the last one. Remember that a void pointer is nothing more than a pointer where we have chosen not to tell the compiler what type of data it points to. The compiler therefore knows how much memory is required to store the pointer itself, but it has no way of knowing how much memory is required to store the value that it points to.
So, using a variable of type double
as an example, the code
that prints out the bit representation is:
double data;
unsigned char mask;
unsigned char *ptr;
ptr = (unsigned char *) &data;
i = sizeof data;
while (i--)
{
PutByte(ptr)
ptr++;
}
The sizeof operator computes the number of bytes required to store the object reference by its operand (which is located to the right of the operator. This is an operator and not a function, so the operand (which is not a function argument) does not need to be surrounded by parentheses (since there is no argument list). It is perfectly legal to do so and many programmers do. Other programmers avoid doing so because they don't want anything looking like a function call unless it is a function call. It's purely a style issue.
After using PutByte() to put out the bit pattern of the byte stored at the
location pointed to by ptr
, we increment the value of
ptr
. We continue doing this as long as there are bytes that
haven't been printed.
We can convert this to a function as well:
int PutBytes(void *addr, int bytes)
{
unsigned char *ptr;
int i, count;
for (i=count=0, ptr = (unsigned char *) addr; i < bytes; i++, ptr++)
count += PutByte(ptr);
return count;
}
Consider the following main() function (available in bitrep1.c):
int main(void)
{
unsigned int i;
for(i = 0; i < 10; i++)
{
Putf_u(i, 3, 3); PutC(':');
PutBytes(&i, sizeof i);
PutC('\n');
}
return EXIT_PASS;
}
Compiling and running this on the Borland TurboC/C++ v4.5 compiler yields the following output:
0000:0000000000000000
0001:0000000100000000
0002:0000001000000000
0003:0000001100000000
0004:0000010000000000
0005:0000010100000000
0006:0000011000000000
0007:0000011100000000
0008:0000100000000000
0009:0000100100000000
Does this make sense? The expression sizeof int
evaluates to 2 on this compiler and the value of CHAR_BIT
is 8, so a value of type int requires 16 bits, which is what we have. But, in
pure binary, the value 9, for instance, is represented by the bit pattern
1001. If we look at the last line of the output, we can see that pattern, but
not where we probably expected it (which was at the far right of the output).
To gain more insight into how this is represented internally, let's modify our
code slightly to the following (bitrep2.c):
int main(void)
{
int i;
unsigned long k;
for(i = 0, k = 1; i < (sizeof k)*CHAR_BIT; k*=2, i++)
{
Putf_u(i, 3, 3); PutC(':');
PutBytes(&k, sizeof k);
PutC('\n');
}
return EXIT_PASS;
}
A value of type long
is four bytes long (hence 32 bits)
and we run through the loop 32 times. We start out with a value of 1 stored in
this variable and then each pass we multiply the value by 2. This should shift
the bit that is set one place to the left each time. The result we get,
however, is this:
0000:00000001000000000000000000000000
0001:00000010000000000000000000000000
0002:00000100000000000000000000000000
0003:00001000000000000000000000000000
0004:00010000000000000000000000000000
0005:00100000000000000000000000000000
0006:01000000000000000000000000000000
0007:10000000000000000000000000000000
0008:00000000000000010000000000000000
0009:00000000000000100000000000000000
0010:00000000000001000000000000000000
0011:00000000000010000000000000000000
0012:00000000000100000000000000000000
0013:00000000001000000000000000000000
0014:00000000010000000000000000000000
0015:00000000100000000000000000000000
0016:00000000000000000000000100000000
0017:00000000000000000000001000000000
0018:00000000000000000000010000000000
0019:00000000000000000000100000000000
0020:00000000000000000001000000000000
0021:00000000000000000010000000000000
0022:00000000000000000100000000000000
0023:00000000000000001000000000000000
0024:00000000000000000000000000000001
0025:00000000000000000000000000000010
0026:00000000000000000000000000000100
0027:00000000000000000000000000001000
0028:00000000000000000000000000010000
0029:00000000000000000000000000100000
0030:00000000000000000000000001000000
0031:00000000000000000000000010000000
The above pattern reveals the following behavior:
Within a given byte, the weighting of the bits increases by a factor of two as you move to the left.
Within a multi-byte value, the least significant byte is stored at the lowest address.
Each successive byte is stored at the next higher address.
The most significant byte is stored at the highest address.
Another way of describing this is to say that multi-byte values are stored in a contiguous block of memory starting with the least significant byte at the base address. We certainly could have reversed this and put the most significant byte at the base address. In point of fact, both methods are used - the method above is used by the Intel microprocessors used in PCs while the other method is used by most other processors such as the Motorola processors used in the Apple Macintosh. Both methods have advantages and disadvantages relative to the other.
Because the "Little End" of the value is stored at the base address in a PC, it is called the "Little Endian" storage convention. Similarly, processors such as those made by Motorola use a "Big Endian" convention that stores the "Big End" of the number at the base address.