Character Sets

It is sometimes easy to forget the importance of character sets when developing web applications. I would venture to guess that many developers don't know how to handle issues such as BiDi text or correctly handle multi-byte characters. I have spent a considerable portion of the last month experimenting with sending Unicode characters to a web browser and I have discovered a few hints and "gotcha's" that I would like to pass along.

  • PHP is pretty much multi-byte handicapped. Period.
  • In Java servlets (and I suppose JSPs) add the preferred character encoding to the Content-type header before getting the PrintWriter from the response. This will set the encoding of the Writer to the charset in the Content-type header. I was doing unnecessarily wrapping OutputStreamWriters around the HttpServletResponse's OutputStream to accomplish this manually. It's so much nicer (and cleaner) to have the underlying implementation do this for you.
  • Since ColdFusion runs in Java, it too automatically determines the character set and encoding to use from the Content-type header if you specify one. To do this use the <cfcontent> tag like this <cfcontent type="text/html; charset=utf-8">.
  • Here's the gotcha with ColdFusion (and JSP perhaps, I haven't tried it): Templates are read using the platform's default character set, Windows-1252 on English Windows systems. The characters are stored internally as UCS-2 (it's Java after all), and encoded using your specified character set before being sent to the output buffer. This can be overridden with the <cfprocessingdirective> tag.
| Last updated on

Comments

Post a comment