Tag Archives: collation

Using CONVERT with COLLATE to strip accents from Unicode strings

Today Mladen Prajdic  posted a on Twitter a SQL statement that captured my attention: (http://twitter.com/MladenPrajdic/status/9878219727)

SELECT
 CONVERT(varchar(50), N'æøåáäĺćçčéđńőöřůýţžš')
 COLLATE Cyrillic_General_CI_AI
returns aoaaalcccednooruytzs

If you notice the conversion to Cyrillic_General_CI_AI collation his shows an interesting mechanism for code page translation that we could not explain:

Why does Cyrillic_General_CI_AI or   Greek_CI_AI collation successfully convert accents to their base ascii chars?

 I found an answer to this on this KB page: http://support.microsoft.com/kb/153449

What Happens when a Character on One Code Page is Unavailable on Another

 If a conversion occurs and a character on one code page is unavailable on another, the character gets converted to its closest equivalent character on the other code page. For example, ASCII 224 (“alpha”) on code page 437 is converted to 97(“a”) on the ANSI code page. When you convert it back to code page 437, you receive 97(“a”). This is a problem, because you have now lost the original 224(“alpha”). This can potentially cause incorrect data to be saved on your server, as happens in the following two examples:

Example 1:

You have an OEM server in which you are inserting data from an ANSI client, and your AutoAnsiToOem option is enabled. If the character that you are inserting is valid in both code pages, the correct data will be saved. However, if the character is valid in ANSI but not in OEM, the correct data will not be saved.

Example 2:

You insert data into an ANSI server from an OEM client, and the AutoAnsiToOem option is enabled. If the character that you are inserting is valid in both code pages, the correct data will be saved. However, if the character is valid in OEM and not in ANSI, then correct data will not be saved.

 Nevertheless I this is very useful to exotic strip accents from strings (i.e before an export).