UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [x09x0Ax0Dx20-x7E]            # ASCII
    | [xC2-xDF][x80-xBF]             # non-overlong 2-byte
    | xE0[xA0-xBF][x80-xBF]         # excluding overlongs
    | [xE1-xECxEExEF][x80-xBF]{2}  # straight 3-byte
    | xED[x80-x9F][x80-xBF]         # excluding surrogates
    | xF0[x90-xBF][x80-xBF]{2}      # planes 1-3
    | [xF1-xF3][x80-xBF]{3}          # planes 4-15
    | xF4[x80-x8F][x80-xBF]{2}      # plane 16
)*$%xs', $string))
    return $string;
    return iconv('CP1252', 'UTF-8', $string);
Wednesday, March 31, 2021
answered 11 Months ago
