(Last Mod: 27 November 2010 21:38:42 )
In the overwhelming majority of computers, values are stored in base-2, also known as binary. The reason for this is simple - the circuits necessary to store and access one of two possible values from each memory cell are simple, small, and easy to build upon. As a result, the natural number base to represent integers and other values in a computer is base-2.
But this is not a convenient number base for humans to work with because even reasonably small values require a lot of digits to represent - for instance it requires ten digits just to represent the number one thousand - and with only two different symbols humans are very prone to make mistakes even in just transcribing a number from one place to another.
It would be possible for humans to always work in the familiar base-10 and convert numbers back and forth between binary and decimal representations when needed and, in fact, we do this a great deal. But there are many times when we need to work at a level very close to the actual representation and the sheer number of conversions would become daunting and error prone. What is needed is a compromise - a number base that allows rapid conversions between it and binary but that has a sufficient number of symbols to permit humans to develop a reasonable utility in manipulating it - if for no other reason than to minimize transcription errors.
For a long time, the accepted number base was base-8, also known as "octal", where each octal digit maps directly to exactly three binary digits. But over time the standard length of a memory location settled on the "8-bit byte" and power-of-two multiples of it such as 16-, 32-, and 64-bit groups. Unless the number base chosen is 2N where N is an integral factor or multiple of 8, problems arise when working with such groupings. Thus we really need to use either a base-4, base-16, or base-256 number system since these represent 2-, 4-, and 8-bit groupings respectively. Base-4 is too small and has many of the same problems as Base-2. Base-256 is far too large and just dealing with that many symbols would be a challenging task. Fortunately the remaining alternative, Base-16, is close enough to Base-10 that humans can work with it quite readily.
Base-16, also known as "hexadecimal" or simply "hex", uses sixteen symbols to represent the digit values 0 through 15. The first ten of these are the same as the Base-10 symbols, for obvious reasons. The remaining six are simply the first six characters of the alphabet, A through F. To denote that a number is written in a particular base, the number base is ideally written as a subscript following the number. However, this is not something that can be done in a typical text file and so an alternate means is necessary. Two common ways have evolved over time for denoting hexadecimal numbers. The first is to place an 'H' following the number, such as 93H. The second is to preface the number with the characters '0x' such as 0x93. We will generally use the latter since this is the method used by the C programming language.
Hex | Dec | Bin | # x 16 | # / 16 | 1's comp | Hex | Dec | Bin | # x 16 | # / 16 | 1's comp |
0 | 0 | 0000 | 0 | 0.0000 | F | 8 | 8 | 1000 | 128 | 0.5000 | 7 |
1 | 1 | 0001 | 16 | 0.0625 | E | 9 | 9 | 1001 | 144 | 0.5625 | 6 |
2 | 2 | 0010 | 32 | 0.1250 | D | A | 10 | 1010 | 160 | 0.6250 | 5 |
3 | 3 | 0011 | 48 | 0.1875 | C | B | 11 | 1011 | 176 | 0.6875 | 4 |
4 | 4 | 0100 | 64 | 0.2500 | B | C | 12 | 1100 | 192 | 0.7500 | 3 |
5 | 5 | 0101 | 80 | 0.3125 | A | D | 13 | 1101 | 208 | 0.8125 | 2 |
6 | 6 | 0110 | 96 | 0.3750 | 9 | E | 14 | 1110 | 224 | 0.8750 | 1 |
7 | 7 | 0111 | 112 | 0.4375 | 8 | F | 15 | 1111 | 240 | 0.9375 | 0 |
The table above gives the decimal and binary representations of the sixteen hexadecimal digits as well as the result when the digit is multiplied by sixteen and when it is divided by sixteen. The last column is the "one's complement" of the hexadecimal digit. The product and quotients are useful when converting between decimal and hexadecimal. The one's complement, which is simply the result of inverting each of the digits in the binary representation, is useful for taking the two's compliment (i.e., the negative) of a hexadecimal number as will be shown shortly.
Different memories have different ways of physically storing a '0' or a '1', a hard disk stores it as one of two orientations of a magnetic region on the disk, a CD stores it as one of two reflective states at a region of the disk, main memory generally stores it as either a charged or a discharged capacitor, while faster memories typically store it as a high voltage or a low voltage on a specific node. What is important is that each bit can take on one of two physical states and these states are somewhat arbitrarily assigned the logical designation '0' and '1', or 'LO' and 'HI', or 'SET' and 'CLR' or 'FALSE' and 'TRUE'.
On some devices a low voltage might represent a logical 'FALSE' while in another device on the same computer a low voltage might represent a logical 'TRUE'. Making sure that all of these representations interact properly is the responsibility of the hardware engineers and the operating system designers. At the level you will be working with (in this course), it is sufficient to understand that, at the level of a single bit, a '1', a 'HI'. a 'SET', and a 'TRUE' all have the same representation as far as your programs are concerned.
As humans, we are very conversant with variable width numbers. If we need five digits to represent a number, we use five digits. If we multiply two numbers together and the result is nine digits, we use nine digits. But computers don't work that way. They have a fixed number of bits (which is short for "binary digits") in a given representation, they use all of those bits to represent any number that can be represented by that representation, and they are incapable of representing values that require more bits than that. This is a major constraint that has major implications for computer programs. As it turns out, it also opens the door for a few tricks that allow us to turn these restrictions to our advantage as we shall see when we develop a means for representing negative integers.
Because the width of a given representation is so important, a whole class of terms has arisen to describe them. The most fundamental, the "bit", represents a single binary digit. As pointed out previously, the word "bit" itself is a shortened form of the phrase "binary digit". Because computer programmers are notorious for their (some would say twisted) sense of humor, a bunch of "bits" became known collectively as a "byte". Today, the overwhelming number of processors now in production are grounded in the 8-bit byte, but it is good to understand that "byte" does not necessarily mean eight bits, it is simply the "atomic" data width for a particular processor. The "atomic" size is the smallest width data packet that can be written or read from memory. There are processors with 4-bit bytes, 6-bit bytes, 9-bit bytes and other widths as well.
A "nibble" (again reflecting some programmer's sense of humor) seems to be pretty universally agreed upon as being four bits, regardless of the size of the byte on a particular machine. This is probably because it requires four bits to represent all of the decimal digit values. This is also why hexadecimal is so prevalent in computer programming, even though it is more difficult to work with than octal, since a 4-bit nibble exactly corresponds to a single hexadecimal digit.
Larger width representations are almost always integer multiples of bytes for a given machine. A "word" is generally taken to mean two bytes or, for most systems, sixteen bits and a "double word" or "dword" is generally taken to mean two words, or four bytes, or thirty-two bits for most systems. A "quad word" or "qword" is four words or eight bytes which is sixty-four bits on most systems. As with the term "byte", the meanings of these terms is not universally agreed upon. For instance, when working with the MIPS processor a "word" is 32-bits wide and a "half-word" is the term used for 16-bits.
In this course, unless specifically stated otherwise, the following widths will be associated with each of these terms:
size | bits | bytes |
bit | 1 | 1/8 |
nibble | 4 | 1/2 |
byte | 8 | 1 |
word | 16 | 2 |
dword | 32 | 4 |
qword | 64 | 8 |
If a byte is the basic unit of data storage on a machine, then if we have a representation that consists of more than one byte, in what order are those bytes stored in memory? This is far from being a trivial question. Consider, for example, the following snapshot of portion of memory:
ADDRESS | C32E | C32F | C330 | C331 | C332 | C333 | C334 | C335 | C336 | C337 |
CONTENTS | F3 | FF | DA | 0F | 49 | 40 | 2E | 00 | 21 | 13 |
Note: All values in hex
Each memory location is capable of storing a single byte of data, so if our value (we'll call it an 'object') is represented using more than one byte, there has to be agreement on how those bytes are stored and how we refer to them. When we indicate where an object is stored, we seldom specify the entire range of memory locations that are used. Instead, we give the address of a single byte and it is understood that the object we are referring to is stored in consecutive memory locations and that the address we referred to somehow tells us specifically which bytes they are.
For instance, let's assume that we have stored a particular value using a two-byte signed integer representation and stored it at memory location 0xC336. What does this mean? Is the address the location of the most significant byte, or the least significant byte? Is it the smallest address in the block of memory, or the largest? Is the value stored with the most significant byte at the largest address, or at the smallest? Clearly there are several possible conventions and unless everyone is in agreement on which one is actually used, we are in serious trouble.
So how many conventions are we really dealing with? It turns out that there are four, based on the possible combinations of whether the specified address is for the MSB or the LSB and whether the byte located at the highest address is the MSB or the LSB. For our two-byte signed integer stored at 0xC336 this leaves us with:
Convention | Address | high address | C335 | C336 | C337 | hex | dec |
1 | MSB | MSB | LSB | MSB | xxx | 0x2100 | 8448 |
2 | MSB | LSB | xxx | MSB | LSB | 0x2113 | 8467 |
3 | LSB | MSB | xxx | LSB | MSB | 0x1321 | 4897 |
4 | LSB | LSB | MSB | LSB | xxx | 0x0021 | 33 |
You can see that which convention is used has an enormous impact on what value is retrieved from memory.
Of the four possible conventions, only two are found in common use. By nearly universal convention, the "address" of an object is the lowest numbered address in the block of memory belonging to that object. This is generally referred to as the "base address" of the object and all other addresses in the representation are at higher locations. This characteristic eliminates Conventions #1 and #4 from the above table.
Of the remaining two conventions, we can distinguish them by noting that in Convention #2 the "end" of the value (the LSB) is at the "big" address. Conversely, in Convention #3 the "end" of the value (the LSB) is at the "little" address. These observations lead to the descriptive names that are generally used for the two conventions - Convention #2 is referred to as "Big Endian" and Convention #3 is referred to as "Little Endian".
Another way of interpreting the names Big Endian and Little Endian is that in a Big Endian representation, the "big end" of the value (the MSB) is at the base address. Conversely, in the Little Endian format the "little end" of the value (the LSB) is at the base address.
Storage Conventions for an n-byte value
Convention | Base Address + 0 | Base Address + (n-1) |
Big Endian | MSB | LSB |
Little Endian | LSB | MSB |
Each convention has its share of supporters and critics. Because Motorola (and therefore the Mac) uses Big Endian while Intel (and therefore the PC) uses Little Endian, the arguments of both sides sometimes appear to be pronouncements of religious conviction rather that statements of technical merit. As you should already suspect, the reason that no single convention has gained universal acceptance is because each convention has advantages and disadvantages compared to the other. For instance, in a Big Endian machine, magnitude comparisons can be performed starting with the base address; whereas in a Little Endian machine, most arithmetic operations can be performed starting with the base address.
There are a couple of very common mistakes that people make when they are dealing with the issue of byte ordering. First, they want to apply it to any and every byte in sight and second, they want to apply it to the components that make up a byte.
The most common example of the first mistake is a desire to reverse the order of the bytes that make up a string of characters. But a string of characters is a sequence of individual values each of which represents a single character. Hence the first character is stored at the base address, the second character immediately following the first, and so on. If each character happens to be represented by more than one byte, then the bytes that make up each character are ordered according to that machine's convention.
The most common example of the second mistake is a desire to reverse the order of the nibbles or even the individual bits within a byte. Since a byte is the smallest packet of data we can work with, how it is stored in memory is irrelevant - we work with it as a byte. If we store a byte and then later read that byte, it will have the same value that we stored. Since we store and read the byte as a unit, the whole concept of bit ordering within a byte is largely devoid of meaning. What tricks most people is that they see a value represented as a string of hexadecimal digits and, because the are memorizing things instead of understanding them, they want to reverse the digits instead of the bytes. You must identify where the byte boundaries are before dealing with the issue of byte ordering.
Just remember that byte ordering applies to the individual bytes that make up the multi-byte representation of a single value.
Example: The four-byte pure binary representation for 847389 is stored at memory location 0xC32E, the string "Wow" is stored at memory location 0xC332, and the two byte pure binary representation for 4242 is stored at memory location 0xC336. Show which values are stored in each memory address under both byte-ordering conventions:
Identifying the bytes in each case, we have:
Notice that the string does not have an MSB or LSB identified. Each byte is a separate value.
BIG ENDIAN
ADDRESS | C32E | C32F | C330 | C331 | C332 | C333 | C334 | C335 | C336 | C337 |
CONTENTS | 00 | 0C | EE | 1D | 57 | 6F | 77 | 00 | 10 | 92 |
LITTLE ENDIAN
ADDRESS | C32E | C32F | C330 | C331 | C332 | C333 | C334 | C335 | C336 | C337 |
CONTENTS | 1D | EE | 0C | 00 | 57 | 6F | 77 | 00 | 92 | 10 |
Limiting the discussion to nonnegative (i.e., unsigned) integers for the moment, the most common way of representing an integer using the pattern of zeroes and ones stored in a computer's memory is using weighted binary which is simply the positional numbering system we have been referring since the beginning using Base-2. This representation is commonly referred to as either "straight binary" or "pure binary".
Recall that numbers in memory have fixed widths, so if we are working with a 1-byte integer and are representing the number sixteen, it is not completely correct to say that the representation stored in memory is 10000. The actual representation is 00010000.
In many cases, ignoring the leading zeroes causes no problem, but in other cases it leads to incorrect results. It is best to include the representation of all the bits by default and only take shortcuts as a conscious decision based on an assessment of the risk of making a mistake as a consequence. As you work with the various ways values stored in memory get manipulated, you will gain the necessary insight in order to make informed assessments of this risk. Until then, play it safe and be explicit in your representations.
Thinking back to our decimal system, if asked how many values can be represented by a six-digit decimal number, the answer is one million. Note that we cannot represent the number one million itself, as this requires seven digits. But we can represent a total of one million values because one of those values is zero, and we can represent every value from zero up to but not including one million.
Expressing this in terms of the number of digits N and the base B, we can represent BN values that range from 0 to (BN - 1). For our basic widths, this means the largest value (in decimal) we can represent is:
size | bits | largest representable value |
bit | 1 | 1 |
nibble | 4 | 15 |
byte | 8 | 255 |
word | 16 | 65,535 |
dword | 32 | 4,294,967,295 |
qword | 64 | 18,446,744,073,709,551,615 |
It should be noted that, while many of the newer processors are internally capable of working with 64-bit words, few programming languages yet support this. This increased ability is primarily used to optimize operations and increase data transfer speeds.
For the most part, little is lost as a result because integers greater than 32-bits are rarely needed. This is not to say that they are never needed and programming languages and extensions to languages do exist to work with arbitrarily large integers - even on machines that don't support data that wide directly. The computations are merely done in steps by software once the intrinsic limits of the processor are reached.
To show how large a 64-bit value is, consider a computer that did nothing but count starting from zero and incremented its count at 1 GHz (or 1 billion times per second) would take more than 584 years to reach the representation limit of a 64-bit value. By comparison, the 32-bit limit would be reached in less than five seconds!
If the largest value that can be represented by a 16-bit unsigned integer is 65,535 then what happens when 536 is added to 65,000? The result, 65,536, would require 17 bits to represent and therefore cannot be represented. This is known as an "overflow" condition. Likewise, subtracting a larger number from a smaller number results in an overflow because negative quantities can't be represented (and this is properly called "overflow" and not "underflow" - the term "underflow" is a different phenomenon discussed elsewhere).
Even though the ideal result can't be represented, there is still a result and that result still represents some value. Those 16 bits still exist and each one of them still has either a '0' or a '1' stored in it. The resulting 16-bit unsigned integer still has a value. So what value is it?
Before getting into any more detail in this topic, it is important to note that we are venturing into what is known as "implementation dependent behavior" - meaning that when something like overflow occurs it is entirely possible to get different results on different computers or when using programs written in different languages. Some implementations might crash, some implementations might generate a warning flag or a warning message, and others might blindly proceed as best they can.
In the C programming language, the overflow behavior of unsigned integer representations is well-defined by the C Language Standard. In short, if the result of an arithmetic operation results in a value outside the range 0 to 2N-1 that can be represented, then 2N is repeatedly added or subtracted, as appropriate, until the result is a value that can be represented. That value is the result of the operation. In practice, this means that the lowest N bits of the ideal result is saved. Hence, this means that 536 added to 65,000 is zero, because the only bit that would have been a '1' was the 17th bit.
The same is true for subtraction. If 2 is subtracted from 1, the result would be 65,535 because the normal rules for subtraction would result in borrows flowing out past the 16th bit where they become lost leaving all of the remaining sixteen bits set to a '1'.
Therefore, the integer data formats generally exhibit what is known as "wrap-around" when they overflow. Although this behavior can sometimes be useful, it is more often the cause of subtle logic errors that are difficult to track down unless the person doing the tracking is aware of and conversant with these limitations.
Now the time has come to finally tackle the issue of how to represent negative integers in a computer. In another module we said that humans invented the "minus sign" that we tack on to the left of a negative quantity. Can we do this in a binary representation? The answer is "yes", and it is a viable way of doing it, but it is also not a particularly efficient way.
To use this method, we devote the leftmost bit as a "sign bit" and interpret the remaining bits as the unsigned (i.e., absolute value) of the number being represented. This representation is known as "signed binary" and is used from time to time. But it is not very efficient for two principle reasons. First, it requires that a lot of special rules be employed - just like the rules that we unconsciously apply when we work with signed arithmetic. Second, it requires that different algorithms be used to add two integers that have the same bit pattern depending on whether the bits represent signed or unsigned numbers.
For instance, consider the possibilities in a 3-bit system:
pattern | unsigned | signed binary |
000 | 0 | 0 |
001 | 1 | 1 |
010 | 2 | 2 |
011 | 3 | 3 |
100 | 4 | -0 |
101 | 5 | -1 |
110 | 6 | -2 |
111 | 7 | -3 |
The first thing to note is that there are two representations of "zero" - basically a "positive zero" and a "negative zero". Most systems that use signed binary go to some lengths to trap occurrences of "negative zero" and convert it to the normal representation of zero.
It is instructive to note at this point that we can always expect to have at least one troublesome value when we deal with signed values. The reason is that the total number of values we can represent is even; yet if we are going to be able to represent a set of strictly positive integers, their additive inverses, and zero, we need an odd number of representations. Therefore we will always either have a value for which we cannot represent its additive inverse, or we are going to have more than one way of representing at least one value (usually zero).
Unsigned Addition Table
+ | 000 (0) | 001 (1) | 010 (2) | 011 (3) | 100 (4) | 101 (5) | 110 (6) | 111 (7) |
000 (0) | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
001 (1) | 001 | 010 | 011 | 100 | 101 | 110 | 111 | [000] |
010 (2) | 010 | 011 | 100 | 101 | 110 | 111 | [000] | [001] |
011 (3) | 011 | 100 | 101 | 110 | 111 | [000] | [001] | [010] |
100 (4) | 100 | 101 | 110 | 111 | [000] | [001] | [010] | [011] |
101 (5) | 101 | 110 | 111 | [000] | [001] | [010] | [011] | [100] |
110 (6) | 110 | 111 | [000] | [001] | [010] | [011] | [100] | [101] |
111 (7) | 111 | [000] | [001] | [010] | [011] | [100] | [101] | [110] |
Signed Addition Table (using signed-binary representation of negative numbers)
+ | 000 (0) | 001 (1) | 010 (2) | 011 (3) | 100 (-0) | 101 (-1) | 110 (-2) | 111 (-3) |
000 (0) | 000 | 001 | 010 | 011 | 000 | 101 | 110 | 111 |
001 (1) | 001 | 010 | 011 | [100] | 001 | 000 | 101 | 110 |
010 (2) | 010 | 011 | [100] | [101] | 010 | 001 | 000 | 101 |
011 (3) | 011 | [100] | [101] | [110] | 011 | 010 | 001 | 000 |
100 (-0) | 000 | 001 | 010 | 011 | 000 | 011 | 110 | 111 |
101 (-1) | 101 | 000 | 001 | 010 | 101 | 110 | 111 | [000] |
110 (-2) | 110 | 101 | 000 | 001 | 110 | 111 | [000] | [001] |
111 (-3) | 111 | 110 | 101 | 000 | 111 | [000] | [001] | [010] |
Where overflow occurs, remember that the result is implementation dependent and therefore the values in the above table are only one possible result. The combinations above that are in shaded boxes represent combinations that resulted in some kind of overflow. These combinations are also surrounded by square brackets in case the shading does not show up on your display device. The remaining boxes, however, represent combinations that did not result in any kind of overflow and therefore must yield the results shown.
Note that, given the same pattern of 0's and 1's as arguments, say adding 101 to 001, the result is a very different pattern of 0's and 1's depending on whether the arguments represent signed (result is 000) or unsigned (result is 110) integers. This means that the actual hardware that adds signed integers must be different from the hardware that adds unsigned integers. It turns out that the hardware to do unsigned addition is very simple while the hardware to do signed addition on signed-binary numbers is considerably more complicated - because of all the special rules mentioned previously.
Another way of representing negative numbers using binary integers is to simply shift the values by a fixed amount. If we consider the unsigned representation, we note that the values represented go from the minimum to the maximum as the binary values go from all zeroes to all ones. What if we simply shift the representations by subtracting four from each unsigned value but keep the order the same?
pattern | unsigned | offset binary |
000 | 0 | -4 |
001 | 1 | -3 |
010 | 2 | -2 |
011 | 3 | -1 |
100 | 4 | 0 |
101 | 5 | 1 |
110 | 6 | 2 |
111 | 7 | 3 |
A common name for this particular example would be "excess-4" because the unsigned interpretation of the pattern is four in excess of the value actually represented.
Notice that we still preserve the concept of a "sign bit" - it just happens to be inverted from our prior representation. If we prepare the addition table for this representation, we get:
Unsigned Addition Table (same table as before)
+ | 000 (0) | 001 (1) | 010 (2) | 011 (3) | 100 (4) | 101 (5) | 110 (6) | 111 (7) |
000 (0) | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
001 (1) | 001 | 010 | 011 | 100 | 101 | 110 | 111 | [000] |
010 (2) | 010 | 011 | 100 | 101 | 110 | 111 | [000] | [001] |
011 (3) | 011 | 100 | 101 | 110 | 111 | [000] | [001] | [010] |
100 (4) | 100 | 101 | 110 | 111 | [000] | [001] | [010] | [011] |
101 (5) | 101 | 110 | 111 | [000] | [001] | [010] | [011] | [100] |
110 (6) | 110 | 111 | [000] | [001] | [010] | [011] | [100] | [101] |
111 (7) | 111 | [000] | [001] | [010] | [011] | [100] | [101] | [110] |
Signed Addition Table (using offset binary representation of negative numbers)
+ | 000 (-4) | 001 (-3) | 010 (-2) | 011 (-1) | 100 (0) | 101 (1) | 110 (2) | 111 (3) |
000 (-4) | [100] | [101] | [110] | [111] | 000 | 001 | 010 | 011 |
001 (-3) | [101] | [110] | [111] | 000 | 001 | 010 | 011 | 100 |
010 (-2) | [110] | [111] | 000 | 001 | 010 | 011 | 100 | 101 |
011 (-1) | [111] | 000 | 001 | 010 | 011 | 100 | 101 | 110 |
100 (0) | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
101 (1) | 001 | 010 | 011 | 100 | 101 | 110 | 111 | [000] |
110 (2) | 010 | 011 | 100 | 101 | 110 | 111 | [000] | [001] |
111 (3) | 011 | 100 | 101 | 110 | 111 | [000] | [001] | [010] |
Although the pattern of 0's and 1's is different in the two tables, there is a very simple mapping between them - the most significant bit (sign bit) is inverted and everything else is the same. This means that it is possible to take the hardware adder that works for unsigned integers and, with only a simple modification, make it work for both signed and unsigned integers, although it still must be told which representation is being used.
The principal drawback of offset binary is that, for numbers that exist in both the signed and the unsigned representations, the representations are different. It would be convenient if this were not the case. In fact, the C Language Standard requires that the representation used for signed integer data types represent non-negative values using the exact same representation is those same values in the corresponding unsigned data type. Hence you will never find a conforming C-compiler that uses offset binary to represent signed integers.
If we must keep the representations of the positive values the same is in pure binary yet signed binary is not acceptable because of the hardware overhead for performing arithmetic operations, what might be an alternative? Perhaps the simplest would be to take the representation for the positive value and invert each of the bits to get the representation for its negative counterpart. The result is a system known as "1's compliment".
pattern | unsigned | 1's compliment |
000 | 0 | 0 |
001 | 1 | 1 |
010 | 2 | 2 |
011 | 3 | 3 |
100 | 4 | -3 |
101 | 5 | -2 |
110 | 6 | -1 |
111 | 7 | -0 |
We once again have two representations for 0 which carries with it the associated overhead mentioned previously.
In addition to making it very easy (both manually and in hardware) to take the negative of a number, this system also has the useful property that if a value and it's negative are added together (in an unsigned fashion) that the result is zero (albeit "negative zero"). It also almost retains the useful property that adding 1 (again, in an unsigned fashion) to the pattern for any representable value yields a value that is one greater provided the result is representable. Hence, adding 1 to the pattern 100 (-3) results in the pattern 101 (-2). Adding 1 to the pattern 011 (3) does not result in the pattern for 4, but this is expected since 4 is not representable. However, adding 1 to the value 111 (0) results in the pattern 000 (0) which is not the behavior we would like.
The hardware patch needed to overcome this deficiency is not too excessive, but very few machines actually do so. Especially when the next representation, two's compliment, is available.
It would be very convenient if we could devise a representation where our hardware adder didn't have to distinguish between signed and unsigned integers. We can accomplish this as long each non-overflowing sum has the same pattern of 0's and 1's regardless of whether the input pattern is for a signed or an unsigned integer. The difference would come at the end when it came time to interpret what value the resulting pattern of 0's and 1's represented.
To devise this representation, we draw upon two properties that we have discussed previously. First, the negative of a number is defined as being the additive inverse meaning that the addition of a number and its negative yields a result of zero. The second property is the overflow property of fixed-width integers whereby if the result of an addition operation yields 2N, where N is the number of bits, then normal overflow behavior will wrap around and yield a result of zero.
Combining these two properties, we can then define the negative of a number, for a fixed width representation, as follows;
Given a number A, the number (-A) is the negative of A if the following is true:
A + (-A) = 0 (always true)
A + (-A) = 2N (for N-bit representation where 2N = 0)
Hence,
(-A) = 2N - A
It is important to note that this result is not the general representation for the negative of A, which is given by the original expression of A+(-A)=0. So the pattern that results from using this expression is a specific representation of (-A) known as the "N-bit 2's compliment of A".
It is a simple matter to build up a table of the 2's compliments for all of the values in our N-bit number system. For our three-bit system it yields:
pattern | 3-bit 2's compliment |
000 | 000 |
001 | 111 |
010 | 110 |
011 | 101 |
100 | 100 |
101 | 011 |
110 | 010 |
111 | 001 |
In building up this table, we could do the math in either decimal or binary - both are about the same difficulty.
Example: Find the 3-bit 2's compliment of 3.
Solution:
A = 3
(-A) = 23 - 3 = 8 - 3 = 5
(-A) = 1012
Or:
A = 3 = 0112
(-A) = 2N - 0112 = 10002 - 0112
(-A) = 1012
Note that the 2's compliments are unique in that a given pattern has exactly one 2's compliment and that the 2's compliments appear in pairs - 011 is the 2's compliment of 101 and 101 is the 2's compliment of 011. This is the behavior that we expect from numbers and their additive inverses.
The next step is to decide which of the two complimentary values is positive and which is negative. While technically an arbitrary choice, there is really only one reasonable choice. There is no reason to have the representation for a positive number in an unsigned system be different than the representation for that same positive number in an unsigned system if it is easily avoidable. Therefore the number in each pair that has a leading zero is considered the positive member of the pair and is interpreted according to the normal rules for unsigned integers.
pattern | decimal | 2's comp | negative |
011 | 3 | 101 | -3 |
010 | 2 | 110 | -2 |
001 | 1 | 111 | -1 |
000 | 0 | 000 | 0 |
111 | -1 | 001 | 1 |
110 | -2 | 010 | 2 |
101 | -3 | 011 | 3 |
100 | ? | 100 | ? |
Examination of the above table yields the useful property that, as with signed-binary, the leftmost bit (also known as the "leading" bit) serves as a "sign bit" with positive values having a leading '0' and negative values having a leading '1'.
As with the signed-binary representation, we have one value that can cause us problems. The pattern 100 (or, in general, the N-bit pattern consisting of all 0's except a leading '1') can either be interpreted as -4 (in the 3-bit case) based on it being the result of (-3) - (1) or it can be interpreted as being an alternate representation for zero based on the fact that it is its own 2's compliment and normally only zero can be its own additive inverse. Almost all implementations treat it as the most negative number representable because this is the behavior exhibited when most, operations that are carried out on it. The operations for which it is ill-behaved are then considered to be situations in which overflow has occurred and hence the pattern that results from the operation has no particular meaning. Consider the case of when you take its negative, for instance. The negative of -4 should be +4, but we can't represent +4 in our system and so overflow had to have taken place. Although the resulting pattern doesn't have to have any particular property, it is interesting to note that that we get the same pattern from trying to take the negative of -4 as we get if we try to add 1 to 3.
The utility of the two's compliment representation starts to become apparent when you examine the above table more closely. Notice that if the unsigned values 111 and 001 are added together, normal overflow yields a result of 000 which is the correct result for adding the signed integers 111 and 001 (if the two's compliment representation is used). The same is true for all of the other values and can be summed up by the observation that the normal wraparound behavior for unsigned integers results in the correct translations between the positive and negative regions for signed integers using the two's compliment representation. This is a powerful statement because it means that hardware that can handle addition and subtraction for unsigned integers can automatically handle addition and subtraction for signed integers if they use the two's compliment representation. The same is not true for other operations, but addition is by far one of the most fundamental and frequently used operations in a processor - in fact many processors do not support multiplication or division at all.
Another useful artifact of the two's compliment notation is that if we are comparing two signed integers to see which is larger (with any positive number being larger than any negative number) we see that we can first compare the most significant bit. If they are different, then we need go no further and the value that starts with a '1' is larger if the representations are unsigned, and the value that starts with a '0' is larger if the representations are signed. If the most significant bits are the same, then we can compare the remaining bits as simple unsigned integers without regard to whether they represent a signed, unsigned, positive, or negative value. We don't even have to make the effort to exclude the most significant bit from the comparison because, since they are the same in this case, they have no impact on the outcome.
Unsigned Addition Table (same table as before)
+ | 000 (0) | 001 (1) | 010 (2) | 011 (3) | 100 (4) | 101 (5) | 110 (6) | 111 (7) |
000 (0) | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
001 (1) | 001 | 010 | 011 | 100 | 101 | 110 | 111 | [000] |
010 (2) | 010 | 011 | 100 | 101 | 110 | 111 | [000] | [001] |
011 (3) | 011 | 100 | 101 | 110 | 111 | [000] | [001] | [010] |
100 (4) | 100 | 101 | 110 | 111 | [000] | [001] | [010] | [011] |
101 (5) | 101 | 110 | 111 | [000] | [001] | [010] | [011] | [100] |
110 (6) | 110 | 111 | [000] | [001] | [010] | [011] | [100] | [101] |
111 (7) | 111 | [000] | [001] | [010] | [011] | [100] | [101] | [110] |
Signed Addition Table (using two's compliment representation of negative numbers)
+ | 000 (0) | 001 (1) | 010 (2) | 011 (3) | 100 (-4) | 101 (-3) | 110 (-2) | 111 (-1) |
000 (0) | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
001 (1) | 001 | 010 | 011 | [100] | 101 | 110 | 111 | 000 |
010 (2) | 010 | 011 | [100] | [101] | 110 | 111 | 000 | 001 |
011 (3) | 011 | [100] | [101] | [110] | 111 | 000 | 001 | 010 |
100 (-4) | 100 | 101 | 110 | 111 | [000] | [001] | [010] | [011] |
101 (-3) | 101 | 110 | 111 | 000 | [001] | [010] | [011] | 100 |
110 (-2) | 110 | 111 | 000 | 001 | [010] | [011] | 100 | 101 |
111 (-1) | 111 | 000 | 001 | 010 | [011] | 100 | 101 | 110 |
Notice that the patterns of 0's and 1's in both tables are identical. This means that we have achieved the goal of being able to use the same hardware to carry out addition of both unsigned integers and signed integers that use the two's compliment format.
It turns out that determining whether overflow has occurred, while depending on whether the arguments and results are interpreted as signed or unsigned, is also very straightforward in either case and can be carried out by looking only at the leading bits of both arguments and the result.
Overflow results for C = A + B
(leading bit of A, B, and C)
A | B | C | unsigned | signed |
0 | 0 | 0 | no | no |
0 | 0 | 1 | no | yes |
0 | 1 | 0 | yes | no |
0 | 1 | 1 | no | no |
1 | 0 | 0 | yes | no |
1 | 0 | 1 | no | no |
1 | 1 | 0 | yes | yes |
1 | 1 | 1 | no | no |
The pattern 100 is still not well behaved but notice that the overflow test does work as long as this pattern is interpreted at -4.
If asked to determine the two's complement representation of a number, you must first know what the width of the representation is. Remember, the trick that we used to develop this representation relied on the overflow behavior that occurs because of the finite width available to store a value.
Let's say that we are asked to compute the 8-bit two's complement representation for -73. We can do this a couple of different ways. We could do this by using our definition and saying that the representation is the same as that for the 8-bit unsigned value for (256-73)
256 - 73 = 183
Convert 183 to hex with the aid of our Hexadecimal Reference Table
183 / 16 = 11.4375 The 0.4375 means a digit of '7'
11 / 16 = 0.6875 The 0.6875 means a digit of '11' or 'B'
The zero whole number part means we are done. Of course, in practice, as soon as the whole number part drops below the size of the base (sixteen in our case) there is no need to actually perform the last division - it was done above just to conform to the given algorithm.
So our 8-bit binary representation for -73 is
-73 = 0xB7 = 1011 0111
Few people actually compute it this way because, frankly, few people understand that this is the defining condition for the two's compliment format. Instead, they have been shown a technique that works even though they probably have no idea of why it works. You will.
Consider our defining condition again:
A + (-A) = 2N
(-A) = 2N - A
Note that 2N is NOT an N-bit number. But what if we write it as
2N = (2N-1) + 1
So now we have
(-A) = (2N-1) + 1 - A = [(2N-1) - A] + 1
The representation for (2N-1) is an N-bit number consisting of all 1's. Just as is the case if we subtract a decimal number from an equal length decimal number consisting of all 9's, we are guaranteed to generate no borrow operations. So we can treat each digit one at a time. If we subtract a '0' from '1' we get '1' whereas if we subtract '1' from '1' we get zero. Hence the result is the same as that achieved by simply inverting each bit (changing each '1' to a '0' and vice versa) in the number being subtracted. We'll represent this "bitwise inversion" operation with the tilde character '~'. This means that:
[(2N-1) - A] = ~A
leaving us with the result
(-A) = ~A + 1
This is the formula that most people blindly follow. They couldn't tell you where it comes from or why it works. Furthermore, most people have a hard time remembering whether you invert the bits and then add a one, or if you add a one and then invert the bits. You should never have this problem because you know where the formula comes from and can re-derive it from scratch if need be. That's the difference between comprehending something and simply memorizing it.
So let's see how we use this formula since it is quite useful when we are starting out with a representation that is already in binary or, as we shall see, in hexadecimal. Using the definition directly is more useful when we work in decimal initially, as we did above.
First, if we are going to determine ~A, we must first express A in binary.
A = 73
73 / 16 = 4.5625 => A = 0x49 => 0100 1001
~A = 1011 0110
(-A) = ~A + 1 = 1011 0111 = 0xB7
The subtraction of a number from a pattern consisting of all 1's is called the "one's compliment" of the number. Notice that the 1's compliment is included in our Hexadecimal Reference Table so that we can avoid going all the way to binary:
A = 73
73 / 16 = 4.5625 => A = 0x49
~A = 0xB6
(-A) = ~A + 1 = 0xB6 + 0x01= 0xB7
Aside: To avoid confusion, recognize that you can't think of the "one's compliment" and "two's compliment" as being the same thing except with a different number or a different base. In base B, the compliments are:
N-digit B's compliment of A in Base B is BN - A
N-digit (B-1)'s compliment of A in Base B is (BN - 1) - A
For any given base this method of representing negative integers is valid. So for base-10, we can represent negative values using the ten's compliment. Mechanical computing devices commonly did this before the advent of electronic calculators. We can compute the ten's compliment very easily by first taking the nine's compliment and adding one. But we could also work in base-9 where we would also have a nine's compliment. However, the base-9 nine's compliment and the base-10 nine's compliment are fundamentally different concepts.
The following table shows the mapping of the bit pattern to the value represented using each of the five integer representations that have been discussed for a 4-bit integer.
PATTERN | UNSIGNED | SIGNED BIN | OFFSET BIN | ONE'S COMP | TWO'S COMP |
0000 | 0 | 0 | -8 | 0 | 0 |
0001 | 1 | 1 | -7 | 1 | 1 |
0010 | 2 | 2 | -6 | 2 | 2 |
0011 | 3 | 3 | -5 | 3 | 3 |
0100 | 4 | 4 | -4 | 4 | 4 |
0101 | 5 | 5 | -3 | 5 | 5 |
0110 | 6 | 6 | -2 | 6 | 6 |
0111 | 7 | 7 | -1 | 7 | 7 |
1000 | 8 | -0 | 0 | -7 | -8 |
1001 | 9 | -1 | 1 | -6 | -7 |
1010 | 10 | -2 | 2 | -5 | -6 |
1011 | 11 | -3 | 3 | -4 | -5 |
1100 | 12 | -4 | 4 | -3 | -4 |
1101 | 13 | -5 | 5 | -2 | -3 |
1110 | 14 | -6 | 6 | -1 | -2 |
1111 | 15 | -7 | 7 | -0 | -1 |
PROS
- The bits in the pattern determine the value represented according to the normal positional number system rules.
CONS
- Can not represent negative numbers.
PROS
- Simple conceptually - especially for humans.
- Positive values use same representation as pure binary.
CONS
- Two representations for zero.
- Algebraic Order not the same as pure binary
PROS
- Algebraic order the same as pure binary.
CONS
- Positive numbers do not share same representation as in pure binary.
PROS
- Positive values use same representation as pure binary.
- Hardware is simple (for many operations)
CONS
- Two representations for zero.
- Algebraic Order not the same as pure binary
PROS
- Positive values use same representation as pure binary.
- Same hardware used for pure binary can be used with only minor changes.
CONS
- Algebraic Order not the same as pure binary
From a purist's perspective, our normal integer representations are simply a special case of a fixed point representation where the radix happens to be located just to the right of the rightmost digit. While true, the almost universal use when someone talks about using a "fixed point" representation is when the programmer uses an integer representation but chooses to interpret the results as though the radix point existed at some location other than after the rightmost digit. Usually the effective radix point location is moved to the left so that they can work with fractional values but it is equally valid to move the radix point to the right so that they can work with larger values than they would have otherwise been able to represent.
Generic fixed point representations are seldom an explicit choice supported by a processor's instruction set. As a result, the use of fixed-point arithmetic almost always involves a software implementation of the representation along with all of the manipulation rules.
On today's modern processors and for the majority of applications, the use of fixed point arithmetic is seldom justified given the adequacy of simply using the intrinsic floating point support that is available. But when working with high precision applications, applications where speed and performance are essential, or when working with the slower, less sophisticated processors found in embedded systems the use of fixed-point arithmetic is very common.
One advantage of fixed-point arithmetic is that it permits working with fractional values even on machines that only support integer operations. Another advantage comes from the fact that, on most machines that support both integer and floating point computations, integer operations are much faster and generally more precise than equivalent floating point counterparts.
The primary disadvantage is that the programmer is responsible for the overhead of determining both the representation and implementing the software needed to manipulate values that use that representation - although having this level of control opens the door for the programmer to further optimize the algorithms to attain better performance. Another limitation is that the range of values that can be represented falls well short of that available to comparable floating point representations.
For an example, let's say that we want to use a decimal-based fixed point representation to do our accounting. We want to achieve the following goals: (1) be able to work with any quantity of money less than one million dollars; (2) be able to report quantities to the nearest penny; and (3) minimize the impact of accumulated round-off errors tracking quantities to the nearest 1/100 of a penny in calculations.
On paper, we would normally just be sure to track all quantities to at least four decimal places in order to give us the 1/100 of a penny requested by specification (3) above. But here we are using a fixed-point representation using integers, so we do the following:
We need to work with integers having at least ten digits - six to represent the whole dollar amounts from $0 through $999,999; two more to represent the number of cents; and a final two to represent the number of hundredths of a cent. So the quantity $50,264.81 would be represented as the integer 0502648100.
Some of the rules that need to be observed when manipulating values stored in this representation include:
Switching over to an 8-bit unsigned binary fixed point representation, let's discuss how to convert decimal numbers to the fixed point representation and back. Our representation will place the radix point after the second bit:
b7 | b6 | (rp) | b5 | b4 | b3 | b2 | b1 | b0 |
M3 | M2 | . | M1 | M0 | E3 | E2 | E1 | E0 |
Note: Only the bits labeled bn are actually stored.
The direct interpretation of the bits stored under this representation means that the value represented is given by:
v = (b7 x 21)+(b6 x 20)+(b5 x 2-1)+(b4 x 2-2)+(b3 x 2-3)+(b2 x 2-4)+(b1 x 2-5)+(b0 x 2-6)
Using this relationship, let's represent the quantity π as best we can using this representation:
π ~= 3.14159
r = 3.14159265 (r is the remainder)
To get b7 : is r > 21? (21 = 2)
Yes: b7 = 1 and r = r - 21 = 1.14159265
To get b6 : is r > 20? (20 = 1)
Yes: b6 = 1 and r = r - 20 = 0.14159265
To get b5 : is r > 2-1? (2-1 = 0.5)
No: b5 = 0
To get b4 : is r > 2-2? (2-2 = 0.25)
No: b4 = 0
To get b3 : is r > 2-3? (2-3 = 0.125)
Yes: b3 = 1 and r = r - 21 = 0.01659265
To get b2 : is r > 2-4? (2-4 = 0.0625)
No: b2 = 0
To get b1 : is r > 2-5? (2-5 = 0.03125)
No: b1 = 0
To get b0 : is r > 2-6? (2-6 = 0.015625)
Yes: b0 = 1 and r = r - 21 = 0.00096765
The next bit would be a 0, so we don't need to round up, hence our fixed point binary representation for π is 1100 1001 which represents 11.0010012.
We can certainly work with this expression and do what we need to do, working with negative powers and the consequent division operations and/or multiplication by fractional values is time consuming and error prone. The above example should make you cry out for a better way.
Plus, we have already seen that we can be much more efficient if we can perform much of our work in hexadecimal converting back and forth to binary only when necessary. But the above representation, as given, does not appear to lend itself to working with hexadecimal because the nibble boundaries do not match hexadecimal digit boundaries.
But what happens if we multiply a quantity by a constant and then divide it by that same constant? As long as the constant is not zero, we have not changed anything at all. Considering a decimal example, we can write:
3.14159 = (3.14159 x 10000) / 10000 = 31459 / 10000
Notice that the quantities in the right most expression are all integers and the division operation can be carried out by simply shifting the decimal point four places to the left.
The exact same property applies to binary values:
π = 11.0010012 = (11.0010012 * 26 ) / 26 = 110010012 / 26
Looking at the rightmost expression, we see an integer divided by the base raised to the sixth power, which is the same as moving the radix point six places to the left - which is precisely where we want the radix point to be in our representation.
So the bit pattern in the integer 110010012 is the bit pattern we need to store. If we didn't already know this bit pattern, we could easily get it by using the following steps:
π = X / 26
X = 26 * π = 64π = 201.1
Since we want only the integer X, we round 201.1 to the nearest integer and convert it to binary:
201 / 16 = 12.5625 => (0x12).(0x9) => 0xC9 = 1100 1001 2
Hopefully you will agree that this is a much faster, easier, and less error prone method.
The same method can be used to convert a fixed point binary value back to decimal.
What is the decimal value represented by the binary value 1010 1110 under the fixed point system we have been using - namely six bits to the right of the binary point?
With six bits to the right of the binary point, the value represented is the integer value divided by 26 or:
v = 1010 11102 / 26
v = 0xAE / 64 = ((10*16)+14 ) / 64 = 174 / 64 = 2.71875
The author would like to acknowledge the following individual(s) for their contributions to this module: