Asked  7 Months ago    Answers:  5   Viewed   133 times

How can I convert this string:

This string contains the Unicode character Pi(?)

into an escaped ASCII string:

This string contains the Unicode character Pi(u03a0)

and vice versa?

The current Encoding available in C# converts the ? character to "?". I need to preserve that character.



This goes back and forth to and from the uXXXX format.

class Program {
    static void Main( string[] args ) {
        string unicodeString = "This function contains a unicode character pi (u03a0)";

        Console.WriteLine( unicodeString );

        string encoded = EncodeNonAsciiCharacters(unicodeString);
        Console.WriteLine( encoded );

        string decoded = DecodeEncodedNonAsciiCharacters( encoded );
        Console.WriteLine( decoded );

    static string EncodeNonAsciiCharacters( string value ) {
        StringBuilder sb = new StringBuilder();
        foreach( char c in value ) {
            if( c > 127 ) {
                // This character is too big for ASCII
                string encodedValue = "\u" + ((int) c).ToString( "x4" );
                sb.Append( encodedValue );
            else {
                sb.Append( c );
        return sb.ToString();

    static string DecodeEncodedNonAsciiCharacters( string value ) {
        return Regex.Replace(
            m => {
                return ((char) int.Parse( m.Groups["Value"].Value, NumberStyles.HexNumber )).ToString();
            } );


This function contains a unicode character pi (?)

This function contains a unicode character pi (u03a0)

This function contains a unicode character pi (?)

Tuesday, June 1, 2021
answered 7 Months ago

See unicodedata.normalize

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
'Kluft skrams infor pa federal electoral groe'
Tuesday, June 1, 2021
answered 7 Months ago

Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.

It has Unicode normalization functions, as I was looking for (here form C):

> stri_trans_nfc('u00e9') == stri_trans_nfc('u0065u0301')
[1] TRUE

It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:

> stri_compare('u00e9', 'u0065u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.

Thanks to the developers, Marek G?golewski and Bartek Tartanus, and to Kurt Hornik for the info!

Monday, July 19, 2021
answered 5 Months ago

The solution that comes up, is to encode the text in UTF-8 and add a BOM to specify that the string is actually in UTF-8.

Here it works :

Wednesday, July 28, 2021
answered 5 Months ago

This is the kind of simple code Jon Skeet had in mind in his comment:

final String in = "šđčćasdf";
final StringBuilder out = new StringBuilder();
for (int i = 0; i < in.length(); i++) {
  final char ch = in.charAt(i);
  if (ch <= 127) out.append(ch);
  else out.append("\u").append(String.format("%04x", (int)ch));

As Jon said, surrogate pairs will be represented as a pair of u escapes.

Friday, July 30, 2021
answered 5 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :