Character Encoding

This is a fairly technical topic, but we’ll try to make it as simple as we can.


Character Encoding is the way text is represented. So let’s start by explaining a little about text.

If you are a native English speaker, you probably think of text as the the uppercase letters A through Z, lowercase letters a through z, numeric digits 0 through 9, and some punctuation marks, such as periods (or full stops), commas, question marks, curved, curly, and square brackets, and all those characters above the numbers on your keyboard. And of course, the space character.

If you are a native Spanish speaker, you have those, plus some accented characters, like these: á é í ó ú ü ñ. And additional punctuation marks, like ¡ and ¿ (upside down ! and ?).

Character Encoding Schemes

Most websites, and most browsers, are capable of representing more than one language. One of the older Character Encoding Schemes (CES) is ISO 8859-1. This CES handles many of the Western European languages, including English, Spanish, and French. Some websites in Belize use this. It’s OK, as long as you never use any other languages.

Another is Windows-1252. Microsoft likes to do their thing their way, so this is another CES that will work with English, Spanish, French, and others. The characters, including accented characters, are handled the same way, but some of the punctuation marks, like typographical quotes, are represented differently.

Others are for languages outside of Western Europe, such as those in Eastern Europe, Arabic, Greek, Hebrew, etc.

Recently, the movement has been toward the Unicode character set UTF-8. This encoding is really great, because it can represent virtually any known language’s character set (assuming you have the font in your computer to handle it). This is the character set we use on all our websites.

Encoding problems

If you have spent a good bit of time browsing the web, you have seen websites where you see funny-looking characters where normal characters should be. Like this:

Hon. King�s Secret Recording Could Have Proved Pivotal At Today�s House Meeting

This usually happens when text using one character encoding is pasted into text using a different encoding. You’ll notice that it usually involves quotation marks, or characters that are not "normal" English text characters.

We can represent the above easily in UTF-8:

Hon. King´s Secret Recording Could Have Proved Pivotal At Today´s House Meeting.

Best solution: use UTF-8

Because we are using UTF-8, if your computer has the necessary fonts, you should be able to see these foreign words:

  • üçgen (Turkish)
  • ελληνικά (Greek)
  • עברית חדשה (Hebrew)
  • ฝาน (Thai)
  • هاتف (Arabic)
  • 简体字 (Simplified Chinese), and
  • 日本語 (Japanese).

But if you can’t

Where we sometimes run into trouble is when our customers are using something like Microsoft Word to compose text to send us. Word is probably using a Windows CES, so typographical quotes and some other charactes will cause trouble with the UTF-8 CES. See if you can find out how to turn off typographical quotes, and things like the autocorrection that changes something like 2nd into 2nd. So instead of “quote” you could type "quote", and maybe instead of 2nd you could type second. We should be able to convert anything you send us, but if we have to work harder at it, we may have to charge a little more for that.


Printed from — Character Encoding.