Understanding Text Encodings
From REALbasicWiki
| Overall article skill | ✭ |
What is a “string” ? A string is a bunch of bytes that represent characters, with (implied or explicit) information about the length of the string and how to interpret the bytes.
There are different kinds of strings. In C, a string is an array of bytes (type “char” or “unsigned char”) that has a fixed size, and by convention the meaningful part of the array, which is the actual string, ends when the first byte with the value zero is reached. That is what you call a CString or zero-terminated string.
In Pascal, things are different. A string is an array of bytes as well, but the first byte of that array does not contain a character, but rather a “length byte”, that tells you how long the string is. So you might have an array of 1024 bytes, with the first byte containing the value 0x0A (decimal 10), telling you that the actual string characters are in the second thru 11th byte of the array. What comes after this is undefined, it might or might not be zero bytes. The way of storing stings is called PString.
But over the years, a problem came up. The ASCII code, which was until then exclusively used for character-to-byte mapping, contained only 128 characters. A few of them are control characters with no human-readable meaning, all in all a pretty limited character set. And a waste, too: It used only 7 of the available 8 bits per byte. So the various developers started to “extend” the ASCII charset: The assigned special characters (like accented ones, umlauts, ...) to the values 128..255. But since no global standard existed, and different regions had different requirements for special characters, they implemented it differently. Now a program that reads a series of bytes from disk has to decide which characters (”glyphs”) to display for each byte value. In the early days of DOS it was simple: Your whole PC was set up for a defined “code page” that defined the character mappings. But if I created a text file containing an “Ä” character, my DOS PC would write the byte 0x8E onto the 5.25” floppy disk; I mail that to you by snail mail, and since your Apple IIe “PC” is set up for a different code page, the byte 0x8E will display a completely different character. Result: Garbage on screen.
Contents |
[edit] Unicode
The computer engineers, smart as they were, spotted the dilemma, and thought it would be nice if every PC would have the same character set, containing all possible characters on this planet (and even Klingon1,2, no kidding). “Unicode” was invented. In Unicode, every character is represented by a “code point”, which is a 32-bit number. But as usual with us computer guys, the solution only makes the problem worse: How do you represent a 32-bit number in an array of bytes? Different geeks, different opinions: Least-significant byte (LSB) first, MSB first, variable-length characters, and more. So different forms of representation were created, among them UTF-8 and UTF-16.
But one basic problem was not solved yet: When a computer receives a text file, which ist just the bunch of bytes that represent the characters, the computer has to know which encoding was used to convert the characters to bytes, so it can apply the same encoding scheme to decode the bytes back to characters. This problem remains largely unsolved, as there is no universal way yet to include encoding information into a plain-text document.
[edit] How this applies to strings
With the above two string storage methods, the same problem of not knowing the encoding of text data applies to strings used inside a program. As long as you use only one encoding, you can simply stick to that, and you know that all your strings are encoded in the same way. But today you can never be sure what systems your program will run on, which countries they are in, and what language is used on that system. And when you have to read strings from files, you can almost be sure to have to deal with different encodings.
REALbasic solved this problem in an elegant way: It doesn’t use PStrings or CStrings, but rather created a String class. Internally, all strings in RB are objects, even though you don’t use them that way. The advantage of a string class is that it can store supplementary information; for example, which encoding was used for that string.
To use that feature correctly, you have to make sure that RB knows the encoding of a string when it “enters” RB. For string literals from your code, such as
mystring = "Hello world"
it’s simple, because by convention, all literal strings in RB use the UTF-8 encoding, and so the String object mystring has its ‘Encoding’ property set to UTF-8. But for strings from outside RB, for example when reading from a file or database, RB cannot know which encoding was used when the byte sequence was created. So you have to define the encoding of the string for RB.
[edit] Define and Convert
A great source of confusion in RB is when to use DefineEncoding() and ConvertEncoding(). But it’s quite simple actually.
- You use DefineEncoding() to tell RB which encoding was used when a string was created.
- You use ConvertEncoding() to ask RB to switch the encoding of a string that already has a known (to RB) encoding.
The examples below will show how to make use of them.
[edit] Encodings as Objects
In RB, encoding information comes in the form of the TextEncoding class. There are several ways to obtain such a TextEncoding object, but the most common form is to use the constants of the *global* Encodings object, like this:
dim myencoding as TextEncoding
myencoding = Encodings.WindowsLatin1
Whenever RB wants encoding information from you, it simply expects a TextEncoding object.
[edit] When to use DefineEncoding
As mentioned above, you use DefineEncoding() when you know the encoding of a string and want to let RB know. Most often this happens when you read a string from a file:
dim fi as FolderItem
dim strm as TextInputStream
fi = GetFolderItem("mywindowsfile.txt")
strm = fi.OpenAsTextFile
if strm <> nil then
mystring = fi.ReadAll
myString.DefineEncoding(Encodings.WindowsLatin1) // we tell RB that this string comes from Windows
strm.close
end if
Let’s say the text file mywindowsfile.txt contains just four characters: ABCÄ. That’s three simple ASCII characters and one non-ASCII character. Because this file was created on Windows, the text file contains these four bytes:
0x41 0x42 0x43 0xC4
0xC4 is the character code of the character Ä on Windows systems that use the Latin-1 encoding. We have to tell RB that this string comes from Windows, so it knows it has to return a character Ä when it encounters the byte value 0x0C4, so we use DefineEncoding(). Before calling DefineEncoding(), the Encoding property of mystring is NIL, which means the encoding is unknown or undefined. Note that the DefineEncoding() call does not change the actual string bytes of mystring: It still has stored 0×41 0×42 0×43 0xC4 internally.
Now we need to have the string in a different encoding, because we want to save the string to a text file that can be read by BBEdit. BBEdit expects text files to be in Macintosh encoding, MacRoman to be exact.3 So before we write this file to disk, we convert it from whatever encoding it is in now (which happens to be WindowsLatin1 in this case). The code:
dim fi as FolderItem
dim strm as TextOutputStream
fi = GetSaveFolderItem("text/plain", "MyBBEditFile.txt")
if fi <> nil then strm = fi.CreateTextFile
if strm <> nil then
mystring = mystring.ConvertEncoding(Encodings.MacRoman)
strm.Write mystring
strm.close
end if
We call mystring.ConvertEncoding() and assign the result mack to mystring. This means we have actually changed they byte representation of mystring; RB takes each character of the source string, and converts that to the byte(s) that make up the same character in the target encoding. In pur example, the first three bytes remain unchanged (as all bytes below 128, ie. ASCII characters, have the same character representation in both encodings). But for the fourth byte, RB sees the byte 0xC4 and the encoding of the source string (WindowsLatin1), and therefor knows it is meant to be the character Ä. Now it finds, with the help of the OS, the correct byte value that make up the same character in MacRoman, which happens to be 0×80. So the output of mystring.ConvertEncoding(Encodings.MacRoman) is a string that contains the four bytes 0×41 0×42 0×43 0×80 and has the encoding information MacRoman.
When we finally call strm.Write mystring, RB writes the four bytes 0×41 0×42 0×43 0×80 to the disk file. The bytes are different than what we received in the original Windows file, and still they mean the same thing on their different platforms.
[edit] Of Strings, Characters and Bytes
Another source of confusion is the distinction between bytes and characters. What’s the difference?, many people ask. There’s a very simple answer to this: Since the introduction of Unicode, we have multi-byte characters. That means that a string containing n characters will not necessarily be made up of n bytes; in fact, the possible range is [n .. n*4]. And the number of bytes is not always a multiple of n; in UTF-8, any character can be made up of 1..5 bytes.
Knowing this is important when you have to decide if you use the byte or character versions of RB string functions. For example, RB has both a Len() and a LenB function. The first one returns the number of characters in a string, and the second returns the number of bytes. In some encodings, like WindowsLatin1 and MacRoman, they are equal. But in UTF-8, that depends on the characters in that string. But instead of taking chances, better make sure you use the appropriate version of the function. If you want to create a MemoryBlock to pass a string to a Declare function, you need to size the memoryblock to hold the bytes of the string, which will often be larger than the number of characters.
Expanding on the above example, the string ABCÄ, in different encodings, is represented...
- in MacRoman as
0×41 0×42 0×43 0×80(4 bytes) - in Windows Latin-1 as
0×41 0×42 0×43 0xC4(4 bytes) - in UTF-8 as
0×41 0×42 0×43 0xC3 0×84(5 bytes) - in UTF-16 as
0×00 0×41 0×00 0×42 0×00 0×43 0×00 0xC4(8 bytes).
[edit] References and notes
- see: Proposal to encode Klingon in Plane 1 of ISO/IEC 10646-2
- update: looks like the Unicode consortium has rejected the request.
- actually the encoding expected by BBEdit depends on your default Mac OS language. But for Roman languages, like English or German, that is MacRoman.
[edit] Copyright notice
This article has been originally written by Frank Bitterlich and published on the defunct RBDocs site under the Creative-Commons Non-Commercial, Attribution-Required, Share-Alike License.
