Thursday, October 14, 2010

Of char, int and unicode

So my overflowing smartness and largely my ignorance made me believe that I could convert a HTML encoded string into UTF8 by doing a normal split and join. Something like this:

String testStr = "呀数达";
String[] splitStr = testStr.split("&#");
StringBuilder bldr = new StringBuilder();
for (String str : splitStr)
{
str = str.replace(";","");
bldr.append("\\u"+Integer.toHexString(Integer.valueOf(str)));
}

So I was thinking that a string like this "\u1234\u2345\u3456" as created is what I was looking for.

Well, WRONG! The thing is the string as written above is of length 3 (each set of \uxxxx is treated as one character) and what I was creating was a string of length 18. So how do I fix this? I ended up doing something I never thought I would, casting an int as char!

The way characters are stored is they are stored as their int representations and hence, that is what would make it work. So my code now looks like:

String testStr = "呀数达";
String[] splitStr = testStr.split("&#");
StringBuilder bldr = new StringBuilder();
char chr;
int i;
for (String str : splitStr)
{
str = str.replace(";","");
i = Integer.valueOf(str);
chr = (char) i;
bldr.append(chr);
}

No comments:

Post a Comment