ECE-1021

ASCII

(Last Mod: 27 November 2010 21:38:39 )

ECE-1021 Home


Acknowledgements


Character Representation

While the ability to represent numerical values in a computer is critical, there is another type of data that is also extremely important - text. Therefore we need to devise a means of representing textual information in some form of binary format since that is the only format available to us in a computer.

If you are like many people, at some point while you were growing up you exchanged "secret messages" with a friend using a code where you represented the letter 'A' by the number '1' and so forth ending with 'Z' being represented by the number '26'. In essence, you were encoding text information as numbers and storing those numbers on some type of media (a piece of paper). You then gave that media to someone else who interpreted the numbers stored there according to that same code and, as a result, was able to recover the original text information. Anyone that saw this list of numbers simply saw a list of numbers - you hoped, anyway. They might thing is was a set of lottery numbers, or locker combinations, or any of a number of different things. The only time these numbers served to convey a text message was when the receiving party understood that they represented a text message and understood how that that text was represented. Both of these had to be true.

The means that we store text in computer memory is almost identical to this approach - as is the fact that, once we store that text someplace, the successful retrieval of that text, either by us or someone else, requires that the retriever first know that a certain set of binary values actually represent a text message and, second, that they know how to decode that text from those values. Otherwise, those binary values are just that - a bunch of binary values that could mean anything.


ASCII

The most common code for text in use today is the American Standard Code for Information Interchange, or ASCII for short (pronounced ask-key). The ASCII code was developed in the 1960's for the purpose of permitting data processing equipment, such as teletypewriters, to communicate with each other. It was determined that seven bits were necessary to provide enough codes for all of the characters, numbers, and punctuation marks found on a standard typewriter keyboard plus permit a host of special codes used to control how the data was displayed on the receiving equipment or other aspects of data transmission and reception.

Although each machine would be expected to have its own peculiarities about how it displayed text, the ASCII standard specified a minimum set of behaviors that was expected to be supported. One of the subtler points in this was cursor control. The cursor, in this case, is nothing more than a means of describing where the next character would be printed. On some machines, a person looking at the screen would be able to see the cursor - much as most people are familiar with doing today. On other machines the cursor would not be represented on the display at all and on some machines, such as teletypewriters, such a display would not even be possible. 

Keeping in mind that the display devices at the time were dominated by teletypewriters, the actions required to be supported by the ASCII standard are limited to those that can be supported by a typical teletypewriter. That is why there is no means of arbitrarily moving the cursor to a new location or moving up a line. Other operations, such as the backspace and delete codes, were not required to be supported but were provided because of common tricks used to produce special characters on typewriters by printing multiple characters at the same location. For instance, one commonly used sign for division came about by printing a colon and then backspacing and printing a hyphen. 

Since almost all machines today are centered on the eight-bit byte, we use a single byte for each ASCII code with the most significant bit set to zero. Most machines also support a second set of characters that have the most significant bit set to one. Unfortunately, these "extended" characters are not standardized and vary greatly from machine to machine and even compiler to compiler. It is best to avoid the use of these extensions unless you are willing to accept the extreme compatibility and portability limitations that will result.

7-bit ASCII Chart

  0 1 2 3 4 5 6 7 8 9 A B C D E F
00 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
10 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
20 SP ! " # $ % & ' ( ) * + , - . /
30 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
40 @ A B C D E F G H I J K L M N O
50 P Q R S T U V W X Y Z [ \ ] ^ _
60 ` a b c d e f g h i j k l m n o
70 p q r s t u v w x y z { | } ~ DEL

In the above table, the upper (most significant) nibble is in the leftmost column while the lower (least significant) nibble is in the top row. The actual hex value is obtained by adding the row and column labels together for the character in question. For instance, the ASCII code for the letter 'G' is 0x47 while the ASCII code for the tilde, '~', is 0x7E.

Code 0x20, SP, is simply the blank space character. Denoting it by SP makes it clear that there is a character there, as opposed to a box left inadvertently empty.

Of the 128 codes in the standard ASCII set, 95 of them are what are known as "printing characters" meaning that they print something - although in the case of the space character nothing is physically printed - and advance the active position of the display one position. The remaining 33 characters are known as "control codes" or "control characters". Notice that the use of the terms "code" and "character" are pretty much interchangeable - use whichever one makes the most sense in the context of the discussion. Most of these control codes are for equipment no longer in use today. The C Language Standard only specifies the expected behavior for eight of them. Even then, the only ones you are likely to deal with on a regular basis are the Null Character, the Horizontal Tab, the Line Feed, and the Carriage Return. The Bell can also be useful to sound an audible tone to get the User's attention - perhaps letting them know that some long computation has completed or that an error has been detected.

Because control codes cannot be entered directly as text - they are not printing characters - a means of embedding them into a program's source code must be devised. The C language uses so-called "escape sequences" to perform this task which is nothing more than a backslash character followed by another character that specified the actual character that is intended. The following is the list of the simple escape sequences specified by the C Language Standard.

Simple Escape Sequences

Hex Decimal Code Name C Escape Sequence
0x00 0 NUL Null Character \0
0x07 7 BEL Bell (Buzzer) \a
0x08 8 BS Backspace \b
0x09 9 HT Horizontal Tab \t
0x0A 10 LF Line Feed \n
0x0B 11 VT Vertical Tab \v
0x0C 12 FF Form Feed \f
0x0D 13 CR Carriage Return \r
0x22 34 " Double Quote \"
0x27 39 ' Single Quote \'
0x3F 63 ? Question Mark \?
0x5C 92 \ Backslash \\

The first of these is, technically, not a simple escape sequence (the phrase "simple escape sequence" has a specific meaning in the C Language Standard). Instead, it is an "octal escape sequence". Any code can be specified by following the backslash character with a one- to three- digit octal number. The number zero just happens to be the NUL character.

The next seven escape sequences are the "alphabetic escape sequences" for fairly obvious reasons. They are the only recognized escape sequences that are indicated by an alphabetic character. The final four escape sequences are defined so as to provide a means of specifying characters that otherwise might be interpreted by the compiler as meaning something else. For instance, as soon as the compiler sees a backslash character it will interpret it as signaling an escape sequence - so if an actual backslash character is meant it must be indicated via an escape sequence.

ASCII Control Codes

For the sake of completeness, the following table gives the full names of all of the control codes.

CODE NAME CODE NAME CODE NAME CODE NAME
NUL  Null BS  Backspace DLE  Data Line Escape CAN  Cancel
SOH  Start of Heading HT  Horizontal Tab DC1  Device Control 1 EM  End of Medium
STX  Start of Text LF  Line Feed DC2  Device Control 2 SUB  Substitute
ETX  End of Text VT  Vertical Tab DC3  Device Control 3 ESC  Escape
EOT  End of Transmission FF  Form Feed DC4  Device Control 4 FS  File Separator
ENQ  Enquiry CR  Carriage Return NAK  Negative Acknowledge GS  Group Separator
ACK  Acknowledge SO  Shift Out SYN  Synchronous Idle RS  Record Separator
BEL  Bell SI  Shift In ETB  End Transmission Block US  Unit Separator
            DEL  Delete

ASCII Elements

Control Codes

The control codes are those codes that control the equipment, including the cursor, without producing any text on the screen. All codes between 0x00 and 0x1F are control codes as is the last code, 0x7F. With the exception of the DEL code (0x7F), notice that any code that has the upper 2-bits (3-bits in an eight bit byte) set to zero (also referred to as being cleared) is a control code.

Printing Characters

The printing characters are the complement of the control codes in that all of the ASCII codes fall into exactly one of these two categories. In general - and as the name suggests - the printing characters are codes that cause exactly one character to be printed and the cursor is to be advanced one place to the right of its prior position (and possibly then wrapped to the next line). There is one code that does not cause anything visible to be printed, namely the space (SP) character (Ox20). This character is included in the set of printing characters because it has the same effect on the cursor as any of the other printing characters.

Graphical Characters

As noted above, the space character is different from the others in that it produces no visible effect on the screen - remember, we are assuming that the cursor is not visible. Therefore a slightly smaller subset of characters are the graphing characters that actually produce some type of graphical image on the display. In short, the set of graphical characters is simply the set of printing characters less the space character. Conversely, the set of printing characters is the combination of the graphical characters and the space character.

White space

White space refers to any code that moves the cursor without producing any visible effect. It not only includes the blank space character, but also the six control codes that produce a back space (BS), horizontal tab (HT), line feed (LF), vertical tab (VT), form feed (FF), or carriage return (CR). Notice that all six of these control codes are grouped together between 0x08 and 0x0D (inclusive).

Uppercase Characters

The set of uppercase characters is self-explanatory and is composed simply of the codes between 'A' and 'Z' inclusive.

Lowercase Characters

The set of lowercase characters is self-explanatory and is composed simply of the codes between 'a' and 'z' inclusive.

Numerical Characters

The set of numerical characters is the set of characters that represent each of the decimal digits. Since, like the alphabetical characters, these are arranged in order from '0' to '9', the set of codes that comprise this set are simply those between '0' and '9' inclusive.

Alphabetical Characters

This is simply the union of the set of uppercase characters with the set of lowercase characters.

Alphanumerical Characters

This is simply the union of the set of alphabetical characters with the set of numerical characters.

Punctuation Characters

Any graphical character that is not a member of the set of alphanumerical characters is considered a punctuation character. Because these are scattered around the table of codes to fill in the gaps between the more organized groups, the easiest way to check if a character is a punctuation character is just as defined - check if it is both a graphical character and not an alphanumerical character.

Hexadecimal Characters

The set of hexadecimal characters, like the numerical characters, is the set of characters that represent each of the hexadecimal digit. In addition to the normal decimal digits, the first six alphabetical characters (both uppercase and lowercase) are included as well.


Acknowledgements

The author would like to acknowledge the following individual(s) for their contributions to this module: