ECE-1021

Intro to Strings and Pointers in C

(Last Mod: 27 November 2010 21:38:44 )

String Representation

Being able to represent individual characters in specified memory locations is very useful, but it is not very convenient for the way we normally want to work with text information. Typically, when we work with text, we work with "strings" of characters. The obvious way to represent a string in memory is as a sequence of ASCII codes - and this is exactly what is done.

But there is a subtlety that we must deal with - how do we know where the string ends? We have a few options: We could specify a fixed length for all strings and pad any unused portions at the end with a code that tells us that it is not used; we could keep track of both the address where the string starts and how long the string is; or we can keep track of where the string starts and embed a "delimiter" in the string data itself to mark the end of the string. In some languages, such as Pascal, the first byte of a string is not interpreted as an ASCII code, instead it is interpreted as the number of characters in the string. In other languages, including C, the a particular code is placed after the last actual character in the string. In C, this character is the NUL character (ASCII code 0x00) and, therefore, we refer to C as working with "null terminated strings". Note that not all languages that use a terminating character use the NUL character.

Just as the ASCII code for an individual character is expressed in C by surrounding the character by single quotes, a string of ASCII codes, including the terminator, is expressed in C by surrounding the string of characters by double quotes. The NUL terminator does not need to be expressly included in the string - the compiler will add it automatically.

Example: The string "ECE-1021 Module #1" is stored at location 0xDC08. Draw the relevant memory map.

0 1 2 3 4 5 6 7 8 9 A B C D E F

0xDC00 ?? ?? ?? ?? ?? ?? ?? ?? 'E' 'C' 'E' '-' '1' '0' '2' '1'

0xDD00 SP 'M' 'o' 'd' 'u' 'l' 'e' SP '#' '1' NUL ?? ?? ?? ?? ??

The above uses the characters to represent the memory, and this is generally the quickest and most useful way of representing the codes. For completeness, the actual values (in hex) are shown below.

in hex 0 1 2 3 4 5 6 7 8 9 A B C D E F

0xDC00 ?? ?? ?? ?? ?? ?? ?? ?? 45 43 45 2D 31 30 32 31

0xDD00 20 4D 6F 64 75 6C 65 20 23 31 00 ?? ?? ?? ?? ??

Accessing Strings in Memory

It's one thing to say that a particular string is stored at a particular memory location - it is quite another to then actually be able to do something with that string.

Let's say that we wanted to print out the first and last characters in the string in previous example. In words, what we want to accomplish is very simple:

Problem: A null-terminated string is stored at memory location 0xDC08. Print out the first and the last character of the that string.

Solution:

Set ptr = 0xDC08 (the address of the first character in the string).
Retrieve the value that is stored at memory location ptr.
If that value is the NUL character, there is nothing to print - end the task.
If the value is not the NUL character, print it to the screen.
Increment the value of ptr (so that it now holds the address of the next character).
Retrieve the value that is stored at memory location ptr.
If that value is the NUL character, print out previous character and end the task.
Go to Step #5

Putting this into our more structured pseudocode:

SET: ptr = address of string
SET: c = value stored at address contained in ptr.
IF: (NUL NE c) (NE = Not Equal)
1. PUT: The character c to the screen
2. SET: ptr = ptr + 1
3. SET: c = value stored at address contained in ptr.
4. WHILE: (NUL NE c)
  1. SET: ptr = ptr + 1
  2. SET: c = value stored at address contained in ptr.
5. SET: c = the value stored at one less than the address contained in ptr.
6. PUT: The character c to the screen.

Implementing this is C code, we would have:

char *ptr;

char c;

ptr = "First String";

c = *ptr;

if ( '\0' != c )

{

   PutC(c);

   ptr++;

   c = *ptr;

   while ( '\0' != c )

   {

        ptr++;

        c = *ptr;

   }

   c = *(ptr - 1);

   PutC(c);

}

The above code can be condensed quite a bit, but before doing so, let's make sure we fully understand the new elements.

First, there's the variable declaration:

char *ptr;

The variable ptr is known as a "pointer" variable because the value stored in it is intended to be a memory address where some other piece of information is stored, hence it "points" to that other piece of data. When we declare a pointer variable, we indicated that it is a pointer by preceding it with an asterisk.

But it is not sufficient to just know the address where a piece of data is stored - we must also know such things as how many bytes the data item consists of, how it is ordered in memory, and how to interpret the pattern of bits stored in those bytes in order to recover the information stored there. If the compiler knows the data type of the object stored at the location being pointed to, then it has all three of these pieces of information. Hence when we declare a pointer variable, we must also indicated the data type that it is pointing to.

If asked what the data type of ptr is, the most correct response would be that it is a "pointer to an object of type char". This is frequently stated in shorter terms by simply calling it a "pointer to char" or a "char pointer".

Second, there's the line:

ptr = "First String";

The string literal is stored someplace in memory - very possibly as part of the program code itself. At compile time, a string literal using in an expression evaluates to the address at which that string is stored. It's very important that we recognize that, while we can access string literals in the same way as any other string, that we cannot write to or in any way attempt to modify a string literal. We may get unlucky and be able to do it and obtain the desired results, but we have invoked undefined behavior and the next time we compile and run the program we may end up crashing the system.

Next, there's the line:

c = *ptr;

When an asterisk is used as a unary operator - i.e., an operator with only one operand - it can't be the multiplication operator since that requires two operands. In this situation, it is interpreted as being the "dereference" operator, also known as the "indirection" operator. The value stored in ptr is known as a "reference" to the object being pointed to - hence the term "dereference". Similarly, the variable ptr only indirectly tells us the information we are looking for - hence the term "indirection".

One useful way to read the dereference operator is to think of it as saying, "the value stored at". Combined with thinking of a variable name as saying, "the value stored in the variable", this makes statement such as the above quite understandable. It reads.

(The value stored in the variable c) (is set equal to) (the value stored at) (the value stored in the variable ptr)

Fourth, there's the line:

p++;

This is the same increment operator that we've used extensively already. But there is a subtlety when performing arithmetic involving pointers that is easy to overlook. Since the compiler knows what type of data the pointer is pointing to, and since pointers are expected to point to objects, the result of adding an integer to a pointer value should yield a pointer value that points to an object.

For instance, let's say that ptr is equal to 82. If we increment this value like we would probably be tempted to we would get a value of 83. This is fine if the object being pointed to is only one byte wide because then we could have a different object of that same data type at location 83. But what if the data type being pointed to is four bytes wide? Now memory location 83 does not point to an object since an object's address is the lowest numbered memory location of the block of memory containing the object. The next higher value that could possibly serve as a pointer to an object is 86. Hence the expression ptr+1 would yield a value of 86, and not 83. This is not too counterintuitive as long as you think of the integer in the expression being the number of objects to be added to the pointer and not as the number of bytes.

Although there is nothing that guarantees that the next object in memory is of the same type as the one originally pointed to by ptr, when we use pointer arithmetic the compiler makes two assumptions and leaves it up to us to write our code so that the assumptions prove to be valid. The compiler assumes that the object pointed to by the new value is the same as the object pointed to by the original value and it further assumes that the memory between the old value and the new value consists entirely of closely-packed objects of that same data type.

Hence the value ptr+20 would yield a pointer value of 86+20*4 or 166 while the value ptr-10 would yield a value of 66.

At this point it will hopefully not come as a surprise that, if ptr1 and ptr2 point point to values of a data type that is eight bytes wide, that if ptr1 is equal to 2064 and ptr2 is equal to 2016, that the expression ptr1-ptr2 is equal to 6 and not 48. Remember that an integer added to a pointer value represents the number of objects of that data type to be skipped. The same is true for subtracting two pointers - the integer that results does not represent the number of bytes between the two addresses, but the number of objects of that data type between the two addresses.

While it may not be obvious at first glance, the addition of two pointer values is undefined. In terms of objects of a particular data type, the sum of two addresses is completely devoid of meaning.

Finally, let's consider the line:

c = *(ptr - 1);

Everything needed to interpret this line has already been discussed, but this is a good time to point out one of the common mistakes that programmers make when working with pointers and performing pointer arithmetic - namely that the dereference operator has higher priority than the arithmetic operators, hence if we had written:

c = *ptr-1;

The result would have been perfectly legal code that would have compiled and executed and done something we hadn't wanted it to do - it would have gone to the address pointed to by ptr, retrieved a value of type char from there, subtracted one from that value, and stored the result in the variable c. What we wanted it to do was to calculate the address of the object of type char immediately preceding the one presently pointed to by ptr, retrieve the value from that location, and store the result in the variable c - hence the parentheses are essential to getting the code to behave as intended.

Now let's write the above code in the mode condensed format alluded to previously:

char *ptr;

ptr = "First String";

if ( '\0' != *ptr )

{

   PutC(*ptr++);

   while ( '\0' != *ptr )

        ptr++;

   PutC(*(ptr-1));

}

Acknowledgements

The author would like to acknowledge the following individual(s) for their contributions to this module:

Loren Blaney from the 6502 Users Group for proofreading the entire module and pointing out countless typos, grammatical errors, and making numerous suggestions that led to a better module.

in hex	0	1	2	3	4	5	6	7	8	9	A	B	C	D	E	F
0xDC00	??	??	??	??	??	??	??	??	45	43	45	2D	31	30	32	31
0xDD00	20	4D	6F	64	75	6C	65	20	23	31	00	??	??	??	??	??