Asked  7 Months ago    Answers:  5   Viewed   47 times

This is how my connection is set:
Connection conn = DriverManager.getConnection(url + dbName + "?useUnicode=true&characterEncoding=utf-8", userName, password);

And I'm getting the following error when tyring to add a row to a table:
Incorrect string value: 'xF0x90x8Dx83xF0x90...' for column 'content' at row 1

I'm inserting thousands of records, and I always get this error when the text contains xF0 (i.e. the the incorrect string value always starts with xF0).

The column's collation is utf8_general_ci.

What could be the problem?



MySQL's utf8 permits only the Unicode characters that can be represented with 3 bytes in UTF-8. Here you have a character that needs 4 bytes: xF0x90x8Dx83 (U+10343 GOTHIC LETTER SAUIL).

If you have MySQL 5.5 or later you can change the column encoding from utf8 to utf8mb4. This encoding allows storage of characters that occupy 4 bytes in UTF-8.

You may also have to set the server property character_set_server to utf8mb4 in the MySQL configuration file. It seems that Connector/J defaults to 3-byte Unicode otherwise:

For example, to use 4-byte UTF-8 character sets with Connector/J, configure the MySQL server with character_set_server=utf8mb4, and leave characterEncoding out of the Connector/J connection string. Connector/J will then autodetect the UTF-8 setting.

Tuesday, June 1, 2021
answered 7 Months ago

It looks like you have a normalization problem in your database. Instead of storing the same state as string over and over again, put all state names into a table of it's own and reference them.

This will also ensure that you do not - e.g. by accident - put binary different but equally looking data into different rows you're not able to properly align later on as you just did.

Alternatively you should query distinct rows and update them, so that you do at least have the same binary string data for same-named states. E.g. if Mysql is able to actually align these state strings but PHP - due to it's binary nature of strings - is not.

Saturday, May 29, 2021
answered 7 Months ago

I just figured out one method to avoid above errors.

Save to database

user.first_name = u'Rytis'.encode('unicode_escape')
user.last_name = u'Slatkevi?ius'.encode('unicode_escape')

print user.last_name
>>> Slatkeviu010dius
print user.last_name.decode('unicode_escape')
>>> Slatkevi?ius

Is this the only method to save strings like that into a MySQL table and decode it before rendering to templates for display?

Wednesday, June 2, 2021
answered 7 Months ago

Try to specify the encoding in the DB URL :


Here's some more information regarding my answer :

The following is taken from the MySQL documentation ( :

All strings sent from the JDBC driver to the server are converted automatically from native Java Unicode form to the client character encoding, including all queries sent using Statement.execute(), Statement.executeUpdate(), Statement.executeQuery() as well as all PreparedStatement and CallableStatement parameters with the exclusion of parameters set using setBytes(), setBinaryStream(), setAsciiStream(), setUnicodeStream() and setBlob().

Setting the Character Encoding
The character encoding between client and server is automatically detected upon connection. You specify the encoding on the server using the character_set_server for server versions 4.1.0 and newer. The driver automatically uses the encoding specified by the server. To override the automatically detected encoding on the client side, use the characterEncoding property in the URL used to connect to the server. To allow multiple character sets to be sent from the client, use the UTF-8 encoding, either by configuring utf8 as the default server character set, or by configuring the JDBC driver to use UTF-8 through the characterEncoding property.

I encountered a similar problem a few months ago. I checked the default value of character_set_server on my MySQL (using the “mysqld --verbose –help” command). It was latin1.

Saturday, August 28, 2021
answered 4 Months ago

Characters that require utf8mb4 are represented as a surrogate pair in Java, and occupy 2 chars. A simple way to detect them is therefore checking if the length of the string in chars is the same as the number of code points:

boolean requiresMb4(String s) {
    int len = s.length();
    return len != s.codePointCount(0, len);
Tuesday, August 31, 2021
answered 3 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :