Asked  7 Months ago    Answers:  5   Viewed   133 times

How can I convert this string:

This string contains the Unicode character Pi(?)

into an escaped ASCII string:

This string contains the Unicode character Pi(u03a0)

and vice versa?

The current Encoding available in C# converts the ? character to "?". I need to preserve that character.

 Answers

22

This goes back and forth to and from the uXXXX format.

class Program {
    static void Main( string[] args ) {
        string unicodeString = "This function contains a unicode character pi (u03a0)";

        Console.WriteLine( unicodeString );

        string encoded = EncodeNonAsciiCharacters(unicodeString);
        Console.WriteLine( encoded );

        string decoded = DecodeEncodedNonAsciiCharacters( encoded );
        Console.WriteLine( decoded );
    }

    static string EncodeNonAsciiCharacters( string value ) {
        StringBuilder sb = new StringBuilder();
        foreach( char c in value ) {
            if( c > 127 ) {
                // This character is too big for ASCII
                string encodedValue = "\u" + ((int) c).ToString( "x4" );
                sb.Append( encodedValue );
            }
            else {
                sb.Append( c );
            }
        }
        return sb.ToString();
    }

    static string DecodeEncodedNonAsciiCharacters( string value ) {
        return Regex.Replace(
            value,
            @"\u(?<Value>[a-zA-Z0-9]{4})",
            m => {
                return ((char) int.Parse( m.Groups["Value"].Value, NumberStyles.HexNumber )).ToString();
            } );
    }
}

Outputs:

This function contains a unicode character pi (?)

This function contains a unicode character pi (u03a0)

This function contains a unicode character pi (?)

Tuesday, June 1, 2021
 
The_Perfect_Username
answered 7 Months ago
52

See unicodedata.normalize

title = u"Klüft skräms inför på fédéral électoral große"
import unicodedata
unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
'Kluft skrams infor pa federal electoral groe'
Tuesday, June 1, 2021
 
Puneet
answered 7 Months ago
24

Ok, it appears that a package has been developed to enhance and simplify the string manipulation toolbox in R (finally!). It is called stringi and looks very promising. Its documentation is very well written, and in particular I find the pages about encodings and locales much more enlightening than some of the standard R documentation on the subject.

It has Unicode normalization functions, as I was looking for (here form C):

> stri_trans_nfc('u00e9') == stri_trans_nfc('u0065u0301')
[1] TRUE

It also contains a smart comparison function which integrates these normalization questions and lessens the pain of having to think about them:

> stri_compare('u00e9', 'u0065u0301')
[1] 0
# i.e. equal ;
# otherwise it returns 1 or -1, i.e. greater or lesser, in the alphabetic order.

Thanks to the developers, Marek G?golewski and Bartek Tartanus, and to Kurt Hornik for the info!

Monday, July 19, 2021
 
CodeCaster
answered 5 Months ago
62

The solution that comes up, is to encode the text in UTF-8 and add a BOM to specify that the string is actually in UTF-8.

Here it works :

  • http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=%EF%BB%BFR%C3%A9my+Hubscher
Wednesday, July 28, 2021
 
Juriy
answered 5 Months ago
36

This is the kind of simple code Jon Skeet had in mind in his comment:

final String in = "šđčćasdf";
final StringBuilder out = new StringBuilder();
for (int i = 0; i < in.length(); i++) {
  final char ch = in.charAt(i);
  if (ch <= 127) out.append(ch);
  else out.append("\u").append(String.format("%04x", (int)ch));
}
System.out.println(out.toString());

As Jon said, surrogate pairs will be represented as a pair of u escapes.

Friday, July 30, 2021
 
toesslab
answered 5 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 
Share