"am i correctly supporting utf-8 in my php apps?" Answer’s

0

Do I need to convert everything that I receive from the user agent (HTML form's & URI) to UTF-8 when the page loads

No. The user agent should be submitting data in UTF-8 format; if not you are losing the benefit of Unicode.

The way to ensure a user-agent submits in UTF-8 format is to serve the page containing the form it's submitting in UTF-8 encoding. Use the Content-Type header (and meta http-equiv too if you intend the form to be saved and work standalone).

I have heard that you should mark you forms as UTF-8 also (accept-charset="UTF-8")

Don't. It was a nice idea in the HTML standard, but IE never got it right. It was supposed to state an exclusive list of allowable charsets, but IE treats it as a list of additional charsets to try, on a per-field basis. So if you have an ISO-8859-1 page and an “accept-charset="UTF-8"” form, IE will first try to encode a field as ISO-8859-1, and if there's a non-8859-1 character in there, then it'll resort to UTF-8.

But since IE does not tell you whether it has used ISO-8859-1 or UTF-8, that's of absolutely no use to you. You would have to guess, for each field separately, which encoding was in use! Not useful. Omit the attribute and serve your pages as UTF-8; that's the best you can do at the moment.

If a UTF string is improperly encoded will something go wrong

If you let such a sequence get through to the browser you could be in trouble. There are ‘overlong sequences’ which encode an low-numbered codepoint in a longer sequence of bytes than is necessary. This means if you are filtering ‘<’ by looking for that ASCII character in a sequence of bytes, you could miss one, and let a script element into what you thought was safe text.

Overlong sequences were banned back in the early days of Unicode, but it took Microsoft a very long time to get their shit together: IE would interpret the byte sequence ‘xC0xBC’ as a ‘<’ up until IE6 Service Pack 1. Opera also got it wrong up to (about, I think) version 7. Luckily these older browsers are dying out, but it's still worth filtering overlong sequences in case those browsers are still about now (or new idiot browsers make the same mistake in future). You can do this, and fix other bad sequences, with a regex that allows only proper UTF-8 through, such as this one from W3.

If you are using mb_ functions in PHP, you might be insulated from these issues. I can't say for sure as mb_* was unusable fragile when I was still writing PHP.

In any case, this is also a good time to remove control characters, which are a large and generally unappreciated source of bugs. I would remove chars 9 and 13 from submitted string in addition to the others the W3 regex takes out; it is also worth removing plain newlines for strings you know aren't supposed to be multiline textboxes.

Was UTF-16 written to address a limit in UTF-8?

No, UTF-16 is a two-byte-per-codepoint encoding that's used to make indexing Unicode strings easier in-memory (from the days when all of Unicode would fit in two bytes; systems like Windows and Java still do it that way). Unlike UTF-8 it is not compatible with ASCII, and is of little-to-no use on the Web. But you occasionally meet it in saved files, usually ones saved by Windows users who have been misled by Windows's description of UTF-16LE as “Unicode” in Save-As menus.

seems_utf8

This is very inefficient compared to the regex!

Also, make sure to use utf8_unicode_ci on all of your tables.

You can actually sort of get away without this, treating MySQL as a store for nothing but bytes and only interpreting them as UTF-8 in your script. The advantage of using utf8_unicode_ci is that it will collate (sort and do case-insensitive compares) with knowledge about non-ASCII characters, so eg. ‘?’ and ‘?’ are the same character. If you use a non-UTF8 collation you should stick to binary (case-sensitive) matching.

Whichever you choose, do it consistently: use the same character set for your tables as you do for your connection. What you want to avoid is a lossy character set conversion between your scripts and the database.

Wednesday, March 31, 2021
 
Floris
answered 11 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :