ECE-1021

Representing ASCII Codes Symbolically in C

(Last Mod: 27 November 2010 21:38:43 )

Objectives
Character Constants
Escape Sequences
Character Expressions

Objectives

Understand how ASCII codes are represented symbolically in C.
Understand Character Constants
Understand Character Escape Sequences

Whenever the compiler sees a single character surrounded by single quotes (unless it is within a larger string literal or a comment), it is considered a "character constant" and the compiler looks up the code for that character and inserts it into the source code in place of the quoted character. So the following line of code:

putc('H', stdout);

Would translate (on most machines) to the following line of code at compile time:

putc(72, stdout);

Because 72 (in decimal) is the ASCII code for an uppercase H.

By allowing us to use 'H' instead of 72 the compiler greatly increases the convenience of using the text functions. In fact, because of this character-to-integer translation facility built into the compiler, we very rarely need to be concerned with what the actual values of the codes are that represent specific characters.

But this facility serves another very useful purpose as well. We could have typed 72 as the first argument and on almost all C compilers it would have worked as we expected it to. But the assumption we would have been making is that all C compilers use the ASCII code to represent characters - and not all of them do. Not all processors support ASCII, particularly older processors or even some modern microcontrollers that are not intended to perform text operations. So the writer of the C compiler for that platform may have had little choice but to use something other than ASCII to represent characters. In this case, 72 is probably not going to be the correct code for an uppercase H - but 'H' will still work because the person that wrote the compiler also wrote the character-to-integer translation facility for that compiler.

Escape Sequences

Most characters can be represented as character constants in just the manner described above - namely placing the character between single quotes. But how would we represent a character constant consisting of a single quote? We can't - at least not using the method described above. Similarly, there are other characters that present problems. For instance, what if we want to specify the code for a carriage return or the code that sounds the alarm?

We can specify a character that we can't use directly by way of an "escape sequence" which is a backslash character, '\', followed by one or more characters that tell the compiler which code to actually use. There are four types of escape sequences - simple, octal, hexadecimal, and universal character name. The last is beyond the scope of this lesson and will not be discussed.

Simple escape sequences are quite friendly but only give access to a few of the commonly used character codes that can't be represented directly. The other two allow us to specify any character code but are much less readable.

Both methods begin the same way - by initiating an "escape sequence" using the When the compiler encounters a backslash character in a context in which it is expecting to see a character constant it skips the backslash and interprets the character following it as being a special instruction and not a character constant. If the character following the backslash is not one of the specific characters

Simple Escape Sequences

There are four characters that can give us problems in specifying them directly. First, we have already seen that the single-quote will be troublesome and, second, we can expect the backslash to be troublesome. We will discover in a later lesson that the double-quote is troublesome within a string literal. Finally, the C standard also provides for the possibility that the question mark could be troublesome in some implementations, although this is not an issue for TurboC v4.5.

These four characters are specified by simply following the backslash with that character as follows:

\' - specifies a single quote character
\" - specifies a double quote character
\\ - specifies a single backslash
\? - specifies a question mark.

In instances where using one of these characters directly would not cause confusion for the compiler, you may either use it directly or use the corresponding escape sequence.

In addition to the four characters above, there are seven additional character escape sequences - known as "alphabetic escape sequences" - used to represent the more commonly used nongraphic characters. Most of these are used to move the active position of the display without actually displaying anything.

\a - (alert) (ASCII 0x07 (7 in decimal)
- Audible or visual alert - no affect on active position.
\b - (backspace) (ASCII 0x08) (8 in decimal)
- Moves the active position of the specifies back one on the current line.
- Behavior unspecified if the initial position is the first position on the line.
\f - (form feed) (ASCII 0x0C) (12 in decimal)
- Moves the active position to the initial position on the next logical page.
\n - (new line) (ASCII 0x0A) (10 in decimal)
- Moves the active position to the start of the next line.
\r - (carriage return) (ASCII 0x0D) (13 in decimal)
- Moves the active position to the start of the current line.
\t - (horizontal tab) (ASCII 0x09) (9 in decimal)
- Move the active position to the next tab position on the current line.
- Behavior is unspecified if there are no further tab positions.
\v - (vertical tab) (ASCII 0x0B) (11 in decimal)
- Moves the active position to the initial position of the next vertical tab position.
- Behavior is unspecified if there are no further tab position.

As with all things C, these escape sequences are case sensitive.

These escape sequences are specified to have the behaviors indicated whether or not ASCII is used and regardless of what the actual value the sequence maps to within the character set that is used.

Octal Escape Sequences

If the backslash is followed by one to three octal digits (i.e., the digits from 0 through 7), then the value represented by those digits, in base-8, is the explicit value of the constant. This is known as an "octal escape sequence". For instance, since 110 in base-8 (72 is decimal) is the ASCII code for an uppercase H, the following could be used to print an uppercase H:

putc('\110', stdout);

Unlike the simple escape sequences, this method directly specifies the value to use regardless of what the character set is or what that value will actually do when displayed. Hence, it is not recommended to specify character codes directly unless necessary.

One very important exception to this is the NUL character, which is used to terminate character strings in C - the topic of a future lesson. The NUL character is defined as being a character constant with all bits cleared (i.e., equal to zero). The integer equivalent of this bit pattern is zero and hence we can use '\0' to represent this character constant regardless of what character set is used. It is the only character code that has a specified value independent of the character set.

Hexadecimal Escape Sequences

If we want to use a hexadecimal representation of the character code we can do so by following the backslash character with a lowercase x followed by any number of hexadecimal digits. For instance:

putc('\x0048', stdout);

The same caution about using this method to directly specify characters that applied to octal escape sequences also applies to these sequences.

Character Expressions

People frequently lose sight of the fact that a character, including the ones you are viewing on the screen at this moment (assuming you aren't looking at a hardcopy printout) are represented as integer values. We see characters only because some piece of hardware looked at that integer value and, having been told to interpret it as the code for a particular character, managed to generate the control signals necessary to display something that looks like the character that code represents on our screen.

The fact that we can choose to represent a particular integer by typing a character constant does not alter the fact that all we have really done is simply enter an integer constant in a different way. That integer is still an integer and can be used any place an integer constant can be used.

One way to avoid falling into this trap is to read the expression 'H' not as, "an uppercase H," but rather as, "the ASCII code for an uppercase H." If you do this consciously and consistently for just a short time, you will soon discover that you will have start to automatically view them as integer values.

For all of the following examples, we assume that our compiler uses ASCII to represent characters.

Example 1: Arbitrary use of character constants

For instance, the following lines of code are perfectly valid:

double x['\n']['A'];

x['\a'+3][' '+'\t'] = 2.0 * 'Z' - 'T'*'e';

These lines would simply translate to:

double x[10][65];

x[7+3][32+9] = 2.0 * 90 - 84*101;

This is not to say that the above code, while legal, is acceptable. It isn't.

Example 2: Printing a decimal digit

Assuming that the variable n has been set to some value between 0 and 9 (i.e., one of the decimal digits), the code to print out this character would be:

putc( '0'+i , stdout);

Whatever the ASCII code for the character that looks like the digit zero is, the code for the character for that looks like the value stored in the variable i is simply i greater.

Example 3: Printing a hexadecimal digit

Assuming that the variable n has been set to some value between 0 and 15 (i.e., one of the hexadecimal digits), the code to print out this character might be:

putc( (i<10)? '0'+i : 'A'+(i-10) , stdout);

If the conditional test (i<10)? evaluates as TRUE, then we use the first expression which is the same one used to print out a decimal digit. If it evaluates as false then we use the second expression which uses the same idea to map the values 10 through 15 to the characters 'A' through 'F'.

Example 4: Converting an upper case character to lower case

If the value in c is the code for an uppercase character, we want to convert it to its lowercase equivalent. If it is not an uppercase character, we wish to leave it unchanged.

c = ( ('A' <= c) && (c <= 'Z') )? ( 'a' + (c - 'A') ) : ( c ) ;

If we examine the ASCII codes carefully, we will see that the upper and lower case characters are all separated by 32 - the ASCII code for the space character. If we also note that the uppercase characters precede the lowercase characters - which is easy to remember since this is the order we would probably like them to be ordered if we were alphabetizing a list - then we could write the above statement simply as:

c += ( ('A' <= c) && (c <= 'Z') )? ' ' : 0;

The above statement can truly only be expected to work with ASCII, whereas the previous version is likely to work with most character sets since all that is required is that the codes be grouped in ascending order for both the uppercase and lowercase characters regardless of where the two sets of characters are relative to each other.