ECE-1021 Lesson 5 

Bytes by Bit Banging

(Last Mod: 27 November 2010 21:38:39 )

ECE-1021 Home



Objectives


Prerequisite Material


Co-requisite Material


Overview

In a prior lesson we developed a function that allowed us to print out integers in a specified number base. One of the bases we could use was base two - or binary. The question is then, does this function allow us to view the internal status of the bits in a data object?

The answer is a resounding...sometimes.

That function prints out the binary representation of the value stored in a variable - there is a very large distinction between the value and how that value is represented in memory. For instance, we might have the same value, 42, stored in memory in four places. First it might appear as a 16-bit integer, then it might appear elsewhere as a 32-bit integer. Someplace else it might appear as a 32-bit single-precision floating point value and, finally, it might appear in the fourth location as a 64-bit double-precision floating point value. While the value, namely 42, stored in all four locations is the same, the pattern of 1's and 0's stored in each of those segments of memory are very different - and not just in how many bits are used to store the value.

Consider a situation where we were asked to calculate how much water we would have if we added 6 gallons to a tank presently containing 103 pints. Whether we express the amount of water in the tank in gallons or pints or liters or any other unit of fluid measure does not change the value - which is the amount of water in the tank - it merely changes how that value is expressed, or represented.

In order to perform the addition we would need to express both values - the water presently in the tank and the water to be added to the tank - using the same representation. After doing this, we can add the two values together and obtain a new value expressed in a particular way, in this case that might be 151 pints. We can now, if necessary, convert that to a different representation, say 18.875 gallons, but in doing so we have not changed the value - the amount of water in the tank has not changed, merely its representation.

Consider the following code fragment:

int k, m;

double x, y;

 

k = 15;

x = 15;

m = k + x;

y = k + x;

Although k and x have the same value, the bits actually stored in memory are very different because they are stored using different representations.

When we add k and x, the compiler has to translate the representation of at least one of the values so that they are compatible. Once they are in the same form, the operation can be carried and the result is in a particular representation - or data type. In this case the result of the addition operation would be a value represented as a double. When we then store that value in the variable m, it is converted to the representation of that value as an int. When we store it in the variable y, it can be stored directly using the same representation it is already in. But, in both cases, the values stored in the two variables are the same (because both variables use a representation capable of representing the value 30).

Most of the time, we leave it up to the compiler and the processor to deal with issues associated with representation. We don't want to deal with those issues - that's why we spent good money on a compiler. For instance, the technique that we used to print out a value in a particular number base worked with the value - we didn't care how that value was actually represented in memory. But, as a result, we would get the same results - the same pattern of zeros and ones on the screen - regardless of what the bit pattern actually was in memory.

The C Language Standard recognizes that, in almost all cases, we concerned with working with values, and not their representations, and therefore it places relatively few restrictions on how values must be represented internally. Those issues are left up to the discretion of the team developing the compiler. If they can use a different representation and get better performance, then they have the option to do that. With almost all data types and operators the compiler designer is expected to perform whatever translations and manipulations are necessary to implement the expected behavior on the values that are being used. If we try to rely on how those translations and manipulations affect the internal representations we do so at our own peril.

But all is not lost. The C Language Standard recognizes the programmers do, at times, need or want to deal with the internal representations directly. To make this possible, a limited set of data types and operators are required to have well defined effects on the internal representations stored in memory. Because the need to use these capabilities is rather rare, and because supporting these behaviors is an added burden on the compiler designer which could potentially limit their ability to generate well-optimized executable code, the elements that are defined at this level are kept to a minimum. We must be careful not to stray outside of those elements and assume that what works for them works for other data types and operators otherwise we quickly find ourselves taking a walk through the world of undefined behavior.


The C Bit Banging Bag

Many C programmers have a tendency to make wrong assumptions about how data is represented internally or how we can interact with with that data. Some of these assumptions are quite reasonable - but that doesn't change the fact that they are wrong. In many cases it is a matter of the text and/or course that the programmer learned from neglecting to discuss these details. This is understandable since C is a language that can be worked with at many different levels of detail and any treatment must decide what levels to focus on and what levels to bypass.

It is recommended that you examine the more common fallacies that people have about C to make sure that assumptions you have made to not get in the way of understanding how to work with the internal representations.

So, after dispelling all of these fallacies regarding tricks we can use to get at the internal data representation, what's left? What's left is a small set of highly reliable tools. The very fact that the set only contains a few elements works in our favor because we only have to learn how to use those elements.

The unsigned char Data Type

The only - let's rephrase that, the ONLY - data type whose behavior is completely defined for all bit positions is the unsigned char. The characteristics we can count on with this data type are:

All other data types have at least one characteristic that is implementation-defined.

The bitwise logic operations

The bitwise operations are carried out independently for each bit position within an integer with no interaction with any other bits in the value. The operations available are:

The bit shift operations

The indirection operations

In addition, we can always use any of the other operators, particularly the arithmetic operators, because we know the required characteristics of the internal representation of values in an unsigned char data type - namely that they be represented using pure binary.

, relational, and logical operators, knowing that the values stored in an, and each successive bit increasing the exponent by one

Any other representation has at least one characteristic that may be implementation defined - larger data types by have how to

Before we look at the tools that are available to us, let's examine a few of the ones that aren't - even though many programmers mistakenly assume that they are.


Printing out the Bits in an unsigned char

Lets say that we have a value stored in an unsigned char an we want to print out the bit pattern that is stored there. Since we know it is pure binary, we could use the functions we have already developed to print out the value stored there in binary. This would work. But let's see how to do it using the tools in our Bit Banging Bag.

Top Level Pseudocode:

  1. Start at the Most Significant Bit
  2. Print a '0' is that bit is cleared and a '1' if that bit is set.
  3. Move to the next bit position to the right.
  4. Repeat Steps 2-4 until all bits have been printed.

Let's first review some of the things we already know that apply to these tasks:

unsigned char byte, mask;

 

/* Create bit mask for MSB *.

mask = 1 << (CHAR_BIT - 1);

 

/* Print out bits one-by-one starting with MSB */

while (mask)

{

    PutC((byte & mask)? '1':'0');

    mask >>= 1;

}

If we look at this code, we see that it actually lends itself to a very compact representation using a for() loop:

unsigned char byte, mask;

 

for( mask = 1 << (CHAR_BIT - 1); mask; mask >>= 1)

    PutC((byte & mask)? '1':'0');


Printing Out the Bits in Bytes Other Than unsigned char

The code above works fine if the data is of unsigned char - but what if it is of type signed char.

Before getting into how we can do it, let's be sure we understand the two tempting ways that we cannot use.

We cannot do something like the following:

signed char data;

unsigned char byte, mask;

 

byte = data;

for( mask = 1 << (CHAR_BIT - 1); mask; mask >>= 1)

    PutC((byte & mask)? '1':'0');

This assignment operators work with values and make whatever conversions in the representation is necessary to maintain the same value (to the degree possible) under the new representation. The above code will perform an implicit conversion from a signed to an unsigned char and so the data in byte may or may not have the same bit pattern as the data in data.

Another way that will not work:

signed char data;

unsigned char mask;

 

for( mask = 1 << (CHAR_BIT - 1); mask; mask >>= 1)

    PutC(( (unsigned char) byte  & mask)? '1':'0');

This results in an explicit conversion and, unlike the previous case where the compiler will probably issue a warning regarding possible loss of data, explicit conversions like this generally suppress such warnings. Other than that, the two code fragments will produce the same, potentially wrong, output.

So how do we do it right?

The key is to bypass any conversions on the data, implicit or explicit, that the compiler might apply. We do this by using the address at which value is stored instead of the identifier name. An identifier name has a defined data type associated with it that the compiler knows about and will use whenever we access the value stored there. But an address is just a number - namely the number of the memory location where the value is stored.

We store addresses in variables of type pointer. A pointer variable is identified by preceding it with an asterisk when it is defined. Pointer variables also have a data type associated with them so that, when we access the data at the address stored in the pointer the compiler knows what type of data it is and can do all of the things that it would do if we used a normal identifier. Therefore, to print out the internal bit pattern stored in the signed char data, we do the following:

signed char data;

unsigned char  mask;

unsigned char *ptr;

 

ptr = (unsigned char *) &data;

for( mask = 1 << (CHAR_BIT - 1); mask; mask >>= 1)

    PutC((*ptr & mask)?'1':'0');

Notice this is very close to our first incorrect method. The differences are as follows:

We declared a pointer to an unsigned char called ptr. We could have kept the name byte, but it is traditional to use 'p' or 'ptr' somewhere in the names of pointers just to help us humans keep track of what's what. Since we only have one pointer in this code fragment, there's no need to go beyond that.

The values in pointer variables are addresses at where other things are located. The ampersand used in this context is the "address operator" (not to be confused with the bitwise-AND operator) and computes the address of where the variable data is located. The address of data is therefore the value that gets stored in the variable ptr.

But, if performing explicit conversions are to be avoided, why do we have the typecast operation on this line? Because &data is of type pointer to signed char and we are storing it in a variable of type pointer to unsigned char. The compiler will probably complain about a "suspicious pointer conversion" and the typecast tells the it that we really do want to make this assignment. The key is that the value whose representation is getting changed, if necessary, is the representation of an address - not the representation of the data stored at that address.

The final change is that we use *ptr in the putc() call. The asterisk used in this context is the "dereference operator" (not to be confused with multiplication). You can read this operator as something like, "The value stored at the address contain in ...". So you would read *ptr as "the value stored at the address contained in ptr".

Since the compiler has to know how values are represented in order to work with them, it has to know how to interpret the value stored at the address contained in ptr. It does this by looking up the data type associated with the variable ptr which, in this case, is an unsigned char. At the time that this line of code is executed, the compiler has no idea how the value stored in ptr got there. It doesn't need to know nor does it care. Likewise, it doesn't know how the the bit pattern it finds at the address contained in ptr got there. They are merely bits that it is now being told to interpret as an unsigned char.

To make this block of code requires another little trick. The function is as follows:

int PutByte(void *address)

{

    unsigned char  mask;

    unsigned char *ptr;

    int chars;

 

    ptr = (unsigned char *) address;

    for (chars = 0, mask = 1 << (CHAR_BIT - 1); mask; mask >>= 1)

    {

        PutC((*ptr & mask)? '1':'0');

        chars++;

    }

    return chars;

}

As with previous output functions that we have developed, the addition of the variable chars lets it keep track of how many characters are actually printed so that we can return that to the calling function. With this feature, our program can keep track of things like how long a line of text has become so as to decide when to issue a new line character. For our purposes here it is completely extraneous.

We might have been tempted to declare the variable ptr in the function header (as an unsigned char) and this would have worked as expected. The problem is that we would have gotten "suspicious pointer conversion" warnings every time we passed the address of a variable to the function unless we did an explicit conversion to type pointer to unsigned char. But we can assign the value stored in any pointer, regardless of type, to a void pointer without generating a warning because a void pointer has no type information associated with it. It is, in essence, a pure pointer in that it only contains an address value.

The consequence is that we can't use a void pointer for very many things. We can't access the value stored at the address pointed to because the compiler has no information about how to interpret the bits it will find there. We can't perform pointer arithmetic using a void pointer because the compiler doesn't know how big each element in that block of memory is.

By using a void pointer we have taken on the responsibility of keeping track of that information separately and on our own and, if the compiler needs it, we will give it that information by some other means. In this function we do that by copying the value of the void pointer to the variable ptr, which has type information bound to it. Since the value we are storing there does not have type information, we are required to explicitly convert the void pointer to a pointer with type information via a type cast. In essence, we are using ptr as a "middleman" in this process and we could get by without it. If we wanted to use address directly in our PutC() function call, it would have looked like:

        PutC( (*((unsigned char *)address) & mask)? '1':'0');

First we cast the value stored in address from a void pointer to a pointer to unsigned char. We then dereference this pointer to get at the value stored there. Finally we perform a bitwise-AND between that value and the value in mask as before. This is a perfectly legal statement that potentially saves us both a bit of memory and a bit of processing time (the compiler might be clever enough to optimize out the extra overhead in the way we originally did it). But the logic in the statement has become rather convoluted and it is generally better to break the logic up into smaller, more easily understood steps.


Printing Out the Bits in Multi-Btye Representations

The only thing left to accomplish is being able to print out the bit pattern for a value stored in a data type that uses more than one byte. This process is quite straightforward. We start at the first byte, print out its bit pattern, then move on to the next byte. We need to know two things - where the individual bytes are stored and how many bytes total we have to work with. 

The first issue is taken care of by the C language specification:

So if the address of an eight-byte value is memory location 300, we know that the eight bytes are stored at memory locations 300 through 307.

The second issue is taken care of by the sizeof operator. This operator returns the number of bytes required to store the type of object indicated as its operand. We many either use an actual data type, such as:

sizeof double /* Bytes required to store a value of type double */

sizeof int* /* Bytes required to store a pointer to a value of type int */

Or we my use the name of any object that the compiler knows the type of:

int k;

double x;

double *ptr;

void *vptr;

 

x = 3.14;

ptr = &x;

vptr = ptr;

 

k = sizeof x;     /* Bytes required to store a value of type double */

k = sizeof ptr;   /* Bytes required by a pointer to a double */

k = sizeof vptr;  /* Bytes required by a void pointer */ 

k = sizeof *ptr;  /* Bytes required by the value pointed to by a pointer to a double */

k = sizeof *vptr; /* ERROR - Bytes required by the value pointed to by a void pointer */

 

All of these are completely well defined except the last one. Remember that a void pointer is nothing more than a pointer where we have chosen not to tell the compiler what type of data it points to. The compiler therefore knows how much memory is required to store the pointer itself, but it has no way of knowing how much memory is required to store the value that it points to.

So, using a variable of type double as an example, the code that prints out the bit representation is:

double data;

unsigned char  mask;

unsigned char *ptr;

 

ptr = (unsigned char *) &data;

i = sizeof data;

 

while (i--)

{

    PutByte(ptr)

    ptr++;

}

The sizeof operator computes the number of bytes required to store the object reference by its operand (which is located to the right of the operator. This is an operator and not a function, so the operand (which is not a function argument) does not need to be surrounded by parentheses (since there is no argument list). It is perfectly legal to do so and many programmers do. Other programmers avoid doing so because they don't want anything looking like a function call unless it is a function call. It's purely a style issue.

After using PutByte() to put out the bit pattern of the byte stored at the location pointed to by ptr, we increment the value of ptr. We continue doing this as long as there are bytes that haven't been printed.

We can convert this to a function as well:

int PutBytes(void *addr, int bytes)

{

    unsigned char *ptr;

    int i, count;

 

    for (i=count=0, ptr = (unsigned char *) addr; i < bytes; i++, ptr++)

        count += PutByte(ptr);

 

    return count;

}


Interpreting the Results

Consider the following main() function (available in bitrep1.c):

int main(void)
{
    unsigned int i;

    for(i = 0; i < 10; i++)
    {
        Putf_u(i, 3, 3); PutC(':');
        PutBytes(&i, sizeof i);
        PutC('\n');
    }
    return EXIT_PASS;
}

Compiling and running this on the Borland TurboC/C++ v4.5 compiler yields the following output:

0000:0000000000000000
0001:0000000100000000
0002:0000001000000000
0003:0000001100000000
0004:0000010000000000
0005:0000010100000000
0006:0000011000000000
0007:0000011100000000
0008:0000100000000000
0009:0000100100000000

Does this make sense? The expression sizeof int evaluates to 2 on this compiler and the value of CHAR_BIT is 8, so a value of type int requires 16 bits, which is what we have. But, in pure binary, the value 9, for instance, is represented by the bit pattern 1001. If we look at the last line of the output, we can see that pattern, but not where we probably expected it (which was at the far right of the output). To gain more insight into how this is represented internally, let's modify our code slightly to the following (bitrep2.c):

int main(void)
{
    int i;
    unsigned long k;

    for(i = 0, k = 1; i < (sizeof k)*CHAR_BIT; k*=2, i++)
    {
        Putf_u(i, 3, 3); PutC(':');
        PutBytes(&k, sizeof k);
        PutC('\n');
    }
    return EXIT_PASS;
}

A value of type long is four bytes long (hence 32 bits) and we run through the loop 32 times. We start out with a value of 1 stored in this variable and then each pass we multiply the value by 2. This should shift the bit that is set one place to the left each time. The result we get, however, is this:

0000:00000001000000000000000000000000
0001:00000010000000000000000000000000

0002:00000100000000000000000000000000

0003:00001000000000000000000000000000

0004:00010000000000000000000000000000

0005:00100000000000000000000000000000

0006:01000000000000000000000000000000

0007:10000000000000000000000000000000

0008:00000000000000010000000000000000
0009:00000000000000100000000000000000

0010:00000000000001000000000000000000

0011:00000000000010000000000000000000

0012:00000000000100000000000000000000

0013:00000000001000000000000000000000

0014:00000000010000000000000000000000

0015:00000000100000000000000000000000

0016:00000000000000000000000100000000
0017:00000000000000000000001000000000

0018:00000000000000000000010000000000

0019:00000000000000000000100000000000

0020:00000000000000000001000000000000

0021:00000000000000000010000000000000

0022:00000000000000000100000000000000

0023:00000000000000001000000000000000

0024:00000000000000000000000000000001
0025:00000000000000000000000000000010

0026:00000000000000000000000000000100

0027:00000000000000000000000000001000

0028:00000000000000000000000000010000

0029:00000000000000000000000000100000

0030:00000000000000000000000001000000

0031:00000000000000000000000010000000

The above pattern reveals the following behavior:

Another way of describing this is to say that multi-byte values are stored in a contiguous block of memory starting with the least significant byte at the base address. We certainly could have reversed this and put the most significant byte at the base address. In point of fact, both methods are used - the method above is used by the Intel microprocessors used in PCs while the other method is used by most other processors such as the Motorola processors used in the Apple Macintosh. Both methods have advantages and disadvantages relative to the other.

 

Because the "Little End" of the value is stored at the base address in a PC, it is called the "Little Endian" storage convention. Similarly, processors such as those made by Motorola use a "Big Endian" convention that stores the "Big End" of the number at the base address.