diff options
Diffstat (limited to 'deps/mysqllite/strings/CHARSET_INFO.txt')
-rw-r--r-- | deps/mysqllite/strings/CHARSET_INFO.txt | 281 |
1 files changed, 0 insertions, 281 deletions
diff --git a/deps/mysqllite/strings/CHARSET_INFO.txt b/deps/mysqllite/strings/CHARSET_INFO.txt deleted file mode 100644 index 6f0a810be3..0000000000 --- a/deps/mysqllite/strings/CHARSET_INFO.txt +++ /dev/null @@ -1,281 +0,0 @@ - -CHARSET_INFO -============ -A structure containing data for charset+collation pair implementation. - -Virtual functions that use this data are collected into separate -structures, MY_CHARSET_HANDLER and MY_COLLATION_HANDLER. - - -typedef struct charset_info_st -{ - uint number; - uint primary_number; - uint binary_number; - uint state; - - const char *csname; - const char *name; - const char *comment; - - uchar *ctype; - uchar *to_lower; - uchar *to_upper; - uchar *sort_order; - - uint16 *tab_to_uni; - MY_UNI_IDX *tab_from_uni; - - uchar state_map[256]; - uchar ident_map[256]; - - uint strxfrm_multiply; - uint mbminlen; - uint mbmaxlen; - uint16 max_sort_char; /* For LIKE optimization */ - - MY_CHARSET_HANDLER *cset; - MY_COLLATION_HANDLER *coll; - -} CHARSET_INFO; - - -CHARSET_INFO fields description: -=============================== - - -Numbers (identifiers) ---------------------- - -number - an ID uniquely identifying this charset+collation pair. - -primary_number - ID of a charset+collation pair, which consists -of the same character set and the default collation of this -character set. Not really used now. Intended to optimize some -parts of the code where we need to find the default collation -using its non-default counterpart for the given character set. - -binary_number - ID of a charset+collation pair, which consists -of the same character set and the binary collation of this -character set. Not really used now. - -Names ------ - - csname - name of the character set for this charset+collation pair. - name - name of the collation for this charset+collation pair. - comment - a text comment, displayed in "Description" column of - SHOW CHARACTER SET output. - -Conversion tables ------------------ - - ctype - pointer to array[257] of "type of characters" - bit mask for each character, e.g., whether a - character is a digit, letter, separator, etc. - - Monty 2004-10-21: - If you look at the macros, we use ctype[(char)+1]. - ctype[0] is traditionally in most ctype libraries - reserved for EOF (-1). The idea is that you can use - the result from fgetc() directly with ctype[]. As - we have to be compatible with external ctype[] versions, - it's better to do it the same way as they do... - - to_lower - pointer to array[256] used in LCASE() - to_upper - pointer to array[256] used in UCASE() - sort_order - pointer to array[256] used for strings comparison - -In all Asian charsets these arrays are set up as follows: - -- All bytes in the range 0x80..0xFF were marked as letters in the - ctype array. - -- The to_lower and to_upper arrays map only ASCII letters. - UPPER() and LOWER() doesn't really work for multi-byte characters. - Most of the characters in Asian character sets are ideograms - anyway and they don't have case mapping. However, there are - still some characters from European alphabets. - For example: - _ujis 0x8FAAF2 - LATIN CAPITAL LETTER Y WITH ACUTE - _ujis 0x8FABF2 - LATIN SMALL LETTER Y WITH ACUTE - - But they don't map to each other with UPPER and LOWER operations. - -- The sort_order array is filled case insensitively for the - ASCII range 0x00..0x7F, and in "binary" fashion for the multi-byte - range 0x80..0xFF for these collations: - - cp932_japanese_ci, - euckr_korean_ci, - eucjpms_japanese_ci, - gb2312_chinese_ci, - sjis_japanese_ci, - ujis_japanese_ci. - - So multi-byte characters are sorted just according to their codes. - - -- Two collations are still case insensitive for the ASCII characters, - but have special sorting order for multi-byte characters - (something more complex than just according to codes): - - big5_chinese_ci - gbk_chinese_ci - - So handlers for these collations use only the 0x00..0x7F part - of their sort_order arrays, and apply the special functions - for multi-byte characters - -In Unicode character sets we have full support of UPPER/LOWER mapping, -for sorting order, and for character type detection. -"utf8_general_ci" still has the "old-fashioned" arrays -like to_upper, to_lower, sort_order and ctype, but they are -not really used (maybe only in some rare legacy functions). - - - -Unicode conversion data ------------------------ -For 8-bit character sets: - -tab_to_uni : array[256] of charset->Unicode translation -tab_from_uni: a structure for Unicode->charset translation - -Non-8-bit charsets have their own structures per charset -hidden in corresponding ctype-xxx.c file and don't use -tab_to_uni and tab_from_uni tables. - - -Parser maps ------------ -state_map[] -ident_map[] - -These maps are used to quickly identify whether a character is an -identifier part, a digit, a special character, or a part of another -SQL language lexical item. - -Probably can be combined with ctype array in the future. -But for some reasons these two arrays are used in the parser, -while a separate ctype[] array is used in the other part of the -code, like fulltext, etc. - - -Miscellaneous fields --------------------- - - strxfrm_multiply - how many times a sort key (that is, a string - that can be passed into memcmp() for comparison) - can be longer than the original string. - Usually it is 1. For some complex - collations it can be bigger. For example, - in latin1_german2_ci, a sort key is up to - two times longer than the original string. - e.g. Letter 'A' with two dots above is - substituted with 'AE'. - mbminlen - minimum multi-byte sequence length. - Now always 1 except for ucs2. For ucs2, - it is 2. - mbmaxlen - maximum multi-byte sequence length. - 1 for 8-bit charsets. Can be also 2 or 3. - - max_sort_char - for LIKE range - in case of 8-bit character sets - native code - of maximum character (max_str pad byte); - in case of UTF8 and UCS2 - Unicode code of the maximum - possible character (usually U+FFFF). This code is - converted to multi-byte representation (usually 0xEFBFBF) - and then used as a pad sequence for max_str. - in case of other multi-byte character sets - - max_str pad byte (usually 0xFF). - -MY_CHARSET_HANDLER -================== - -MY_CHARSET_HANDLER is a collection of character-set -related routines. Defined in m_ctype.h. Have the -following set of functions: - -Multi-byte routines ------------------- -ismbchar() - detects whether the given string is a multi-byte sequence -mbcharlen() - returns length of multi-byte sequence starting with - the given character -numchars() - returns number of characters in the given string, e.g. - in SQL function CHAR_LENGTH(). -charpos() - calculates the offset of the given position in the string. - Used in SQL functions LEFT(), RIGHT(), SUBSTRING(), - INSERT() - -well_formed_len() - - returns length of a given multi-byte string in bytes - Used in INSERTs to shorten the given string so it - a) is "well formed" according to the given character set - b) can fit into the given data type - -lengthsp() - returns the length of the given string without trailing spaces. - - -Unicode conversion routines ---------------------------- -mb_wc - converts the left multi-byte sequence into its Unicode code. -mc_mb - converts the given Unicode code into multi-byte sequence. - - -Case and sort conversion ------------------------- -caseup_str - converts the given 0-terminated string to uppercase -casedn_str - converts the given 0-terminated string to lowercase -caseup - converts the given string to lowercase using length -casedn - converts the given string to lowercase using length - -Number-to-string conversion routines ------------------------------------- -snprintf() -long10_to_str() -longlong10_to_str() - -The names are pretty self-describing. - -String padding routines ------------------------ -fill() - writes the given Unicode value into the given string - with the given length. Used to pad the string, usually - with space character, according to the given charset. - -String-to-number conversion routines ------------------------------------- -strntol() -strntoul() -strntoll() -strntoull() -strntod() - -These functions are almost the same as their STDLIB counterparts, -but also: - - accept length instead of 0-terminator - - are character set dependent - -Simple scanner routines ------------------------ -scan() - to skip leading spaces in the given string. - Used when a string value is inserted into a numeric field. - - - -MY_COLLATION_HANDLER -==================== -strnncoll() - compares two strings according to the given collation -strnncollsp() - like the above but ignores trailing spaces -strnxfrm() - makes a sort key suitable for memcmp() corresponding - to the given string -like_range() - creates a LIKE range, for optimizer -wildcmp() - wildcard comparison, for LIKE -strcasecmp() - 0-terminated string comparison -instr() - finds the first substring appearance in the string -hash_sort() - calculates hash value taking into account - the collation rules, e.g. case-insensitivity, - accent sensitivity, etc. - - |