Class Collator
A class that implements a locale-sensitive comparator function for use with sorting function. The comparator function assumes that the strings it is comparing contain Unicode characters encoded in UTF-16.
Collations usually depend only on the language, because most collation orders are shared between locales that speak the same language. There are, however, a number of instances where a locale collates differently than other locales that share the same language. There are also a number of instances where a locale collates differently based on the script used. This object can handle these cases automatically if a full locale is specified in the options rather than just a language code.
Options
The options parameter can contain any of the following properties:- locale - String|Locale. The locale which the comparator function will collate with. Default: the current iLib locale.
- sensitivity - String. Sensitivity or strength of collator. This is one of
"primary", "base", "secondary", "accent", "tertiary", "case", "quaternary", or
"variant". Default: "primary"
- base or primary - Only the primary distinctions between characters are significant. Another way of saying that is that the collator will be case-, accent-, and variation-insensitive, and only distinguish between the base characters
- case or secondary - Both the primary and secondary distinctions between characters are significant. That is, the collator will be accent- and variation-insensitive and will distinguish between base characters and character case.
- accent or tertiary - The primary, secondary, and tertiary distinctions between characters are all significant. That is, the collator will be variation-insensitive, but accent-, case-, and base-character-sensitive.
- variant or quaternary - All distinctions between characters are significant. That is, the algorithm is base character-, case-, accent-, and variation-sensitive.
- upperFirst - boolean. When collating case-sensitively in a script that has the concept of case, put upper-case characters first, otherwise lower-case will come first. Warning: some browsers do not implement this feature or at least do not implement it properly, so if you are using the native collator with this option, you may get different results in different browsers. To guarantee the same results, set useNative to false to use the ilib collator implementation. This of course will be somewhat slower, but more predictable. Default: true
- reverse - boolean. Return the list sorted in reverse order. When the upperFirst option is also set to true, upper-case characters would then come at the end of the list. Default: false.
- scriptOrder - string. When collating strings in multiple scripts, this property specifies what order those scripts should be sorted. The default Unicode Collation Algorithm (UCA) already has a default order for scripts, but this can be tailored via this property. The value of this option is a space-separated list of ISO 15924 scripts codes. If a code is specified in this property, its default data must be included using the JS assembly tool. If the data is not included, the ordering for the script will be ignored. Default: the default order defined by the UCA.
- style - The value of the style parameter is dependent on the locale.
For some locales, there are different styles of collating strings depending
on what kind of strings are being collated or what the preference of the user
is. For example, in German, there is a phonebook order and a dictionary ordering
that sort the same array of strings slightly differently.
The static method Collator#getAvailableStyles will return a list of styles that ilib
currently knows about for any given locale. If the value of the style option is
not recognized for a locale, it will be ignored. Default style is "standard".
- usage - Whether this collator will be used for searching or sorting. Valid values are simply the strings "sort" or "search". When used for sorting, it is good idea if a collator produces a stable sort. That is, the order of the sorted array of strings should not depend on the order of the strings in the input array. As such, when a collator is supposed to act case insensitively, it nonetheless still distinguishes between case after all other criteria are satisfied so that strings that are distinguished only by case do not sort randomly. For searching, we would like to match two strings that different only by case, so the collator must return equals in that situation instead of further distinguishing by case. Default is "sort".
- numeric - Treat the left and right strings as if they started with numbers and sort them numerically rather than lexically.
- ignorePunctuation - Skip punctuation characters when comparing the strings.
- onLoad - a callback function to call when the collator object is fully loaded. When the onLoad option is given, the collator object will attempt to load any missing locale data using the ilib loader callback. When the constructor is done (even if the data is already preassembled), the onLoad function is called with the current instance as a parameter, so this callback can be used with preassembled or dynamic loading or a mix of the two.
- sync - tell whether to load any missing locale data synchronously or asynchronously. If this option is given as "false", then the "onLoad" callback must be given, as the instance returned from this constructor will not be usable for a while.
- loadParams - an object containing parameters to pass to the loader callback function when locale data is missing. The parameters are not interpretted or modified in any way. They are simply passed along. The object may contain any property/value pairs as long as the calling code is in agreement with the loader callback function as to what those parameters mean.
- useNative - when this option is true, use the native Intl object provided by the Javascript engine, if it exists, to implement this class. If it doesn't exist, or if this parameter is false, then this class uses a pure Javascript implementation, which is slower and uses a lot more memory, but works everywhere that ilib works. Default is "true".
Operation
The Collator constructor returns a collator object tailored with the above options. The object contains an internal compare() method which compares two strings according to those options. This can be used directly to compare two strings, but is not useful for passing to the javascript sort function because then it will not have its collation data available. Instead, use the getComparator() method to retrieve a function that is bound to the collator object. (You could also bind it yourself using ilib.bind()). The bound function can be used with the standard Javascript array sorting algorithm, or as a comparator with your own sorting algorithm.Example using the standard Javascript array sorting call with the bound function:
var arr = ["ö", "oe", "ü", "o", "a", "ae", "u", "ß", "ä"];
var collator = new Collator({locale: 'de-DE', style: "dictionary"});
arr.sort(collator.getComparator());
console.log(JSON.stringify(arr));
Would give the output:
When sorting an array of Javascript objects according to one of the
string properties of the objects, wrap the collator's compare function
in your own comparator function that knows the structure of the objects
being sorted:
["a", "ae", "ä", "o", "oe", "ö", "ß", "u", "ü"]
var collator = new Collator({locale: 'de-DE'});
var myComparator = function (collator) {
var comparator = collator.getComparator();
// left and right are your own objects
return function (left, right) {
return comparator(left.x.y.textProperty, right.x.y.textProperty);
};
};
arr.sort(myComparator(collator));
Sort Keys
The collator class also has a method to retrieve the sort key for a string. The sort key is an array of values that represent how each character in the string should be collated according to the characteristics of the collation algorithm and the given options. Thus, sort keys can be compared directly value-for-value with other sort keys that were generated by the same collator, and the resulting ordering is guaranteed to be the same as if the original strings were compared by the collator. Sort keys generated by different collators are not guaranteed to give any reasonable results when compared together unless the two collators were constructed with exactly the same options and therefore end up representing the exact same collation sequence.A good rule of thumb is that you would use a sort key if you had 10 or more items to sort or if your array might be resorted arbitrarily. For example, if your user interface was displaying a table with 100 rows in it, and each row had 4 sortable text columns which could be sorted in acending or descending order, the recommended practice would be to generate a sort key for each of the 4 sortable fields in each row and store that in the Javascript representation of the table data. Then, when the user clicks on a column header to resort the table according to that column, the resorting would be relatively quick because it would only be comparing arrays of values, and not recalculating the collation values for each character in each string for every comparison.
For tables that are large, it is usually a better idea to do the sorting on the server side, especially if the table is the result of a database query. In this case, the table is usually a view of the cursor of a large results set, and only a few entries are sent to the front end at a time. In order to sort the set efficiently, it should be done on the database level instead.
Data
Doing correct collation entails a huge amount of mapping data, much of which is not necessary when collating in one language with one script, which is the most common case. Thus, ilib implements a number of ways to include the data you need or leave out the data you don't need using the JS assembly tool:- Full multilingual data - if you are sorting multilingual data and need to collate text written in multiple scripts, you can use the directive "!data collation/ducet" to load in the full collation data. This allows the collator to perform the entire Unicode Collation Algorithm (UCA) based on the Default Unicode Collation Element Table (DUCET). The data is very large, on the order of multiple megabytes, but sometimes it is necessary.
- A few scripts - if you are sorting text written in only a few scripts, you may want to include only the data for those scripts. Each ISO 15924 script code has its own data available in a separate file, so you can use the data directive to include only the data for the scripts you need. For example, use "!data collation/Latn" to retrieve the collation information for the Latin script. Because the "ducet" table mentioned in the previous point is a superset of the tables for all other scripts, you do not need to include explicitly the data for any particular script when using "ducet". That is, you either include "ducet" or you include a specific list of scripts.
- Only one script - if you are sorting text written only in one script, you can either include the data directly as in the previous point, or you can rely on the locale to include the correct data for you. In this case, you can use the directive "!data collate" to load in the locale's collation data for its most common script.
If this collator encounters a character for which it has no collation data, it will
sort those characters by pure Unicode value after all characters for which it does have
collation data. For example, if you only loaded in the German collation data (ie. the
data for the Latin script tailored to German) to sort a list of person names, but that
list happens to include the names of a few Japanese people written in Japanese
characters, the Japanese names will sort at the end of the list after all German names,
and will sort according to the Unicode values of the characters.
Defined in: ilib-full-dyn.js.
Constructor Attributes | Constructor Name and Description |
---|---|
Collator(options)
|
Method Attributes | Method Name and Description |
---|---|
compare(left, right)
Compare two strings together according to the rules of this
collator instance.
|
|
<static> |
Collator.getAvailableScripts()
Retrieve the list of ISO 15924 script codes that are available in this
copy of ilib.
|
<static> |
Collator.getAvailableStyles(locale)
Retrieve the list of collation style names that are available for the
given locale.
|
Return a comparator function that can compare two strings together
according to the rules of this collator instance.
|
|
sortKey(str)
Return a sort key string for the given string.
|
- Parameters:
- {Object} options
- options governing how the resulting comparator function will operate
- Parameters:
- {string} left
- the left string to compare
- {string} right
- the right string to compare
- Returns:
- {number} a negative number if left comes before right, a positive number if right comes before left, and zero if left and right are equivalent according to this collator
- Returns:
- Array.
an array of ISO 15924 script codes that are available
- Parameters:
- {Locale|string=} locale
- The locale for which the available styles are being sought
- Returns:
- Array.
an array of style names that are available for the given locale
- Returns:
- {function(...)|undefined} a comparator function that can compare two strings together according to the rules of this collator instance
The sort key string can be treated as a regular, albeit somewhat odd-looking, string. That is, it can be pass to regular Javascript functions without problems.
- Parameters:
- {string} str
- the original string to generate the sort key for
- Returns:
- {string} a sort key string for the given string