Drawing Blanks

Premature Optimization is a Prerequisite for Success

Archive for September 2013

SQL Server collations and string comparison issues

leave a comment »

Evil Kitty 😼 equals to the empty string

With the “default” collation setting (SQL_Latin1_General_CP1_CI_AS) SQL Server cannot properly compare Unicode strings that contain so called Supplementary Characters (4-byte characters). For example, the “kitty face” below is equal to the empty string:

(If you don’t see the kitty faces then your browser and/or OS sucks as does Chrome on Windows. The symbol is U+1F63C “cat face with wry smile” http://en.wikipedia.org/wiki/Emoticons_(Unicode_block) )

SELECT ‘oops’ WHERE N’😼’ = N”

Moreover, the default collation considers phonetically equivalent Katakana and Hiragana characters to be equal:

SELECT ‘oops’ WHERE N’お’ = N’オ’

These issues are easy to work around by using a binary collation. But in most scenarios we want case-insensitive comparison, so binary is not an option.

Here is a collation that is case insensitive and doesn’t have issues with Supplementary Characters or with Kana:

Latin1_General_100_CI_AS_KS_WS (CI – case-sensitive, AS – accent sensitive, KS – Kana-sensitive, WS – width-sensitive).

In SQL 2012 this one also works: Latin1_General_100_CI_AS_KS_SC (SC –  supplementary characters).

SELECT ‘oops’ WHERE N’A’ COLLATE Latin1_General_100_CI_AS_KS_WS != N’a’ — Case insensitive

SELECT ‘oops’ WHERE N’お’ COLLATE Latin1_General_100_CI_AS_KS_WS = N’オ’ — Kana sensitive

SELECT ‘oops’ WHERE N’😼’ COLLATE Latin1_General_100_CI_AS_KS_WS = N” — Supplements differ from ‘’

SELECT ‘oops’ WHERE N’😼’ COLLATE Latin1_General_100_CI_AS_KS_WS = N’😽’ — Different Supplements differ

SELECT ‘oops’ WHERE N’😼’ COLLATE Latin1_General_100_CI_AS_KS_WS != N’😼’– Equal supplements are equal

One remaining issue that I’m aware of is the German Eszett: ß equals to “ss” under any non-binary collation.

SELECT ‘oops’ WHERE N’ß’ COLLATE Latin1_General_100_CI_AS_KS_WS = N’ss’

Microsoft acknowledged the Eszett issue as a bug in 2008: http://connect.microsoft.com/SQLServer/feedback/details/341130/-and-ss-are-not-equal

But the real question is:

Is it the correct design to use the same collations for the purpose of sorting and for the purpose of equality comparison?


Written by bbzippo

09/10/2013 at 1:22 am

Posted in Uncategorized