I thought I'd take a bit of a break from my series on PHP session security and write about character encoding on websites as I've just been reading up on it recently. I'm not going very in-depth with this subject but have provided links at the bottom to another sites which do. I'll probably write my own detailed post on it one day, it's on my list.
What is character encoding?
It's the method computers use to interpret human languages into a language the computer can understand and store. For example to store the letter A your computer cannot write it down, it has to store it as a string of 1s and 0s and this has to be the same across all computers otherwise when you send a document from one computer to another the letters would appear different. There is a bit more to it than that but I'm just trying to keep this quite basic as the specifics shouldn't be to important for understanding what encoding it.
Should I set encoding on my website?
Basically you should set it because if you don't you're reliant on the clients browser working it out correctly (which isn't very easy or 100% accurate) and will depend on their browser, version of the browser and the default languages settings they have set.
What encoding should I use?
This depends a bit on a few things like if your supporting previous applications and which encoding method they used and what languages you need to support but in most situations it's best to use Unicode Transformation Format-8 (UTF-8). The reason for this is that it is the most common in use today and it interoperates as well as it can with all the other types of encoding out there, well as far as English is concerned anyway unfortunately the interoperability isn't possible for most other languages but it does support any languages you're likely to ever use if it's all in UTF-8.
One quick note as well just to stop any possible confusion with anything else you read UTF-8 it is exactly the same as ISO10646 (RFC 3629).
How do I use the encoding?
This depends on what type of document you're using and how it is being served, I'll be covering what I think are the most common, these are:
- XHTML served as XML (i.e. properly using XHTML)
- XHTML served as HTML
- HTML4 & below
Below is some example code of how to use encoding for your web documents which can be just copy and pasted directly into your documents and it will work, just so you are aware these examples are all for UTF-8.
XHTML served as XML / plain XML
Put this line of code right at the top of your document, even before the doctype declaration.
<?xml version="1.0" encoding="UTF-8"?>
XHTML served as HTML / HTML4 & below
Put this bit of code right at the top of the <head> area in your document.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Put this line of code right at the top of the <head> area in your document. If you were to use the meta tag used in HTML4 in a HTML5 document to declare the correct character set it would still be valid, the new HTML5 method is just similar.
<meta charset="utf-8" />