Sunday, March 25, 2012

Using Unicode Data Types


In previous post, we discussed Collation and how non-Unicode character data is interpreted based on the collation type used. We also discussed how each collation type uses a specific code page. It is possible for one code page to implement multiple collations. But if your application supports international users, it becomes increasingly difficult to find a collation that will support your entire user base.

 Another issue is that some characters cannot be encoded with 1 byte which is used in non-Unicode scheme to encode each character. 1 bye can only represent 256 different characters but some languages such as Kanji (Japanese) or Hangul (Korean) have thousands of characters that must be encoded by 2 bytes.

To rephrase, a non-Unicode encoding scheme encodes each character with 1 byte and can only encode 256 characters. A Unicode encoding system encodes each character using 2 byte and is capable of encoding over 65000 characters. (You can use a non-Unicode encoding system to encode a character that requires 2 bytes, in which case SQL Server will use double byte character set (DBCS) code page).

A non-Unicode scheme is more universal and SQL Server doesn't use code pages to interpret non-Unicode characters. For every non-Unicode character data type in SQL Server, there is an equivalent Unicode character data type (Char > nChar, varchar > nvarchar, text > ntext). If all the SQL data type variables in your application and stored procedures also use Unicode data types, there won't be a need to perform character translations, resulting in performance gain and all the users anywhere in the world will see the same characters.

SQL Server itself stores all the system catalog data such as table, view and stored procedure names in Unicode columns (checkout views/tables/system stored procedures in Master DB).

Unicode standard is maintained by Unicode Consortium and SQL Server supports the Unicode Standard Version 3.2.

Performance Impact
When you declare a column with a Unicode data type, all the characters (regardless of whether they can be represented by 1 byte or require 2 byte) will be stored using 2 bytes, which results in doubling the storage size. For example, a varchar column can store a maximum of 8000 characters (size of a datapage), nvarchar can only store a maximum of 4000 characters. Besides the storage limitation, whether Unicode storage will have performance impact largely depends on your specific situation.

For example, if you define non-Unicode data type but use Windows Collation, SQL Server will use Unicode sorting rules, which are much more resource intensive and complex. So, the performance impact between non-Unicode but with Windows Collation and Unicode will be same. If you did however use non-Unicode data type but with SQL Server collation, sorting and table scan is less resource intensive and faster. Additionally, Unicode data types sorting be slower when you are sorting lots of data because the charaters are stored in double-bytes vs. single byte. Other performance impact could come from conversion between Unicode and non-Unicode if your application uses non-Unicode data types and SQL uses Unicode data types.

Non-Unicode collation is generally good if your application will only be used by the users that can be supported by SQL Server Collation. If you must use Windows Collation or if you must offer universal support, it is best to use Unicode data types.



No comments:

Post a Comment

As always, your comments are welcome and appreciated!