A class that implements a locale-sensitive comparator function
for use with sorting function. The comparator function
assumes that the strings it is comparing contain Unicode characters
encoded in UTF-16.
Collations usually depend only on the language, because most collation orders
are shared between locales that speak the same language. There are, however, a
number of instances where a locale collates differently than other locales
that share the same language. There are also a number of instances where a
locale collates differently based on the script used. This object can handle
these cases automatically if a full locale is specified in the options rather
than just a language code.
Options
The options parameter can contain any of the following properties:
- locale - String|Locale. The locale which the comparator function
will collate with. Default: the current iLib locale.
- sensitivity - String. Sensitivity or strength of collator. This is one of
"primary", "base", "secondary", "accent", "tertiary", "case", "quaternary", or
"variant". Default: "primary"
- base or primary - Only the primary distinctions between characters are significant.
Another way of saying that is that the collator will be case-, accent-, and
variation-insensitive, and only distinguish between the base characters
- case or secondary - Both the primary and secondary distinctions between characters
are significant. That is, the collator will be accent- and variation-insensitive
and will distinguish between base characters and character case.
- accent or tertiary - The primary, secondary, and tertiary distinctions between
characters are all significant. That is, the collator will be
variation-insensitive, but accent-, case-, and base-character-sensitive.
- variant or quaternary - All distinctions between characters are significant. That is,
the algorithm is base character-, case-, accent-, and variation-sensitive.
- upperFirst - boolean. When collating case-sensitively in a script that
has the concept of case, put upper-case
characters first, otherwise lower-case will come first. Warning: some browsers do
not implement this feature or at least do not implement it properly, so if you are
using the native collator with this option, you may get different results in different
browsers. To guarantee the same results, set useNative to false to use the ilib
collator implementation. This of course will be somewhat slower, but more
predictable. Default: true
- reverse - boolean. Return the list sorted in reverse order. When the
upperFirst option is also set to true, upper-case characters would then come at
the end of the list. Default: false.
- scriptOrder - string. When collating strings in multiple scripts,
this property specifies what order those scripts should be sorted. The default
Unicode Collation Algorithm (UCA) already has a default order for scripts, but
this can be tailored via this property. The value of this option is a
space-separated list of ISO 15924 scripts codes. If a code is specified in this
property, its default data must be included using the JS assembly tool. If the
data is not included, the ordering for the script will be ignored. Default:
the default order defined by the UCA.
- style - The value of the style parameter is dependent on the locale.
For some locales, there are different styles of collating strings depending
on what kind of strings are being collated or what the preference of the user
is. For example, in German, there is a phonebook order and a dictionary ordering
that sort the same array of strings slightly differently.
The static method Collator#getAvailableStyles will return a list of styles that ilib
currently knows about for any given locale. If the value of the style option is
not recognized for a locale, it will be ignored. Default style is "standard".
- usage - Whether this collator will be used for searching or sorting.
Valid values are simply the strings "sort" or "search". When used for sorting,
it is good idea if a collator produces a stable sort. That is, the order of the
sorted array of strings should not depend on the order of the strings in the
input array. As such, when a collator is supposed to act case insensitively,
it nonetheless still distinguishes between case after all other criteria
are satisfied so that strings that are distinguished only by case do not sort
randomly. For searching, we would like to match two strings that different only
by case, so the collator must return equals in that situation instead of
further distinguishing by case. Default is "sort".
- numeric - Treat the left and right strings as if they started with
numbers and sort them numerically rather than lexically.
- ignorePunctuation - Skip punctuation characters when comparing the
strings.
- onLoad - a callback function to call when the collator object is fully
loaded. When the onLoad option is given, the collator object will attempt to
load any missing locale data using the ilib loader callback.
When the constructor is done (even if the data is already preassembled), the
onLoad function is called with the current instance as a parameter, so this
callback can be used with preassembled or dynamic loading or a mix of the two.
- sync - tell whether to load any missing locale data synchronously or
asynchronously. If this option is given as "false", then the "onLoad"
callback must be given, as the instance returned from this constructor will
not be usable for a while.
- loadParams - an object containing parameters to pass to the
loader callback function when locale data is missing. The parameters are not
interpretted or modified in any way. They are simply passed along. The object
may contain any property/value pairs as long as the calling code is in
agreement with the loader callback function as to what those parameters mean.
- useNative - when this option is true, use the native Intl object
provided by the Javascript engine, if it exists, to implement this class. If
it doesn't exist, or if this parameter is false, then this class uses a pure
Javascript implementation, which is slower and uses a lot more memory, but
works everywhere that ilib works. Default is "true".
Operation
The Collator constructor returns a collator object tailored with the above
options. The object contains an internal compare() method which compares two
strings according to those options. This can be used directly to compare
two strings, but is not useful for passing to the javascript sort function
because then it will not have its collation data available. Instead, use the
getComparator() method to retrieve a function that is bound to the collator
object. (You could also bind it yourself using ilib.bind()). The bound function
can be used with the standard Javascript array sorting algorithm, or as a
comparator with your own sorting algorithm.
Example using the standard Javascript array sorting call with the bound
function:
var arr = ["ö", "oe", "ü", "o", "a", "ae", "u", "ß", "ä"];
var collator = new Collator({locale: 'de-DE', style: "dictionary"});
arr.sort(collator.getComparator());
console.log(JSON.stringify(arr));
Would give the output:
["a", "ae", "ä", "o", "oe", "ö", "ß", "u", "ü"]
When sorting an array of Javascript objects according to one of the
string properties of the objects, wrap the collator's compare function
in your own comparator function that knows the structure of the objects
being sorted:
var collator = new Collator({locale: 'de-DE'});
var myComparator = function (collator) {
var comparator = collator.getComparator();
// left and right are your own objects
return function (left, right) {
return comparator(left.x.y.textProperty, right.x.y.textProperty);
};
};
arr.sort(myComparator(collator));
Sort Keys
The collator class also has a method to retrieve the sort key for a
string. The sort key is an array of values that represent how each
character in the string should be collated according to the characteristics
of the collation algorithm and the given options. Thus, sort keys can be
compared directly value-for-value with other sort keys that were generated
by the same collator, and the resulting ordering is guaranteed to be the
same as if the original strings were compared by the collator.
Sort keys generated by different collators are not guaranteed to give
any reasonable results when compared together unless the two collators
were constructed with
exactly the same options and therefore end up representing the exact same
collation sequence.
A good rule of thumb is that you would use a sort key if you had 10 or more
items to sort or if your array might be resorted arbitrarily. For example, if your
user interface was displaying a table with 100 rows in it, and each row had
4 sortable text columns which could be sorted in acending or descending order,
the recommended practice would be to generate a sort key for each of the 4
sortable fields in each row and store that in the Javascript representation of the
table data. Then, when the user clicks on a column header to resort the
table according to that column, the resorting would be relatively quick
because it would only be comparing arrays of values, and not recalculating
the collation values for each character in each string for every comparison.
For tables that are large, it is usually a better idea to do the sorting
on the server side, especially if the table is the result of a database
query. In this case, the table is usually a view of the cursor of a large
results set, and only a few entries are sent to the front end at a time.
In order to sort the set efficiently, it should be done on the database
level instead.
Data
Doing correct collation entails a huge amount of mapping data, much of which is
not necessary when collating in one language with one script, which is the most
common case. Thus, ilib implements a number of ways to include the data you
need or leave out the data you don't need using the JS assembly tool:
- Full multilingual data - if you are sorting multilingual data and need to collate
text written in multiple scripts, you can use the directive "!data collation/ducet" to
load in the full collation data. This allows the collator to perform the entire
Unicode Collation Algorithm (UCA) based on the Default Unicode Collation Element
Table (DUCET). The data is very large, on the order of multiple megabytes, but
sometimes it is necessary.
- A few scripts - if you are sorting text written in only a few scripts, you may
want to include only the data for those scripts. Each ISO 15924 script code has its
own data available in a separate file, so you can use the data directive to include
only the data for the scripts you need. For example, use
"!data collation/Latn" to retrieve the collation information for the Latin script.
Because the "ducet" table mentioned in the previous point is a superset of the
tables for all other scripts, you do not need to include explicitly the data for
any particular script when using "ducet". That is, you either include "ducet" or
you include a specific list of scripts.
- Only one script - if you are sorting text written only in one script, you can
either include the data directly as in the previous point, or you can rely on the
locale to include the correct data for you. In this case, you can use the directive
"!data collate" to load in the locale's collation data for its most common script.
With any of the above ways of including the data, the collator will only perform the
correct language-sensitive sorting for the given locale. All other scripts will be
sorted in the default manner according to the UCA. For example, if you include the
"ducet" data and pass in "de-DE" (German for Germany) as the locale spec, then
only the Latin script (the default script for German) will be sorted according to
German rules. All other scripts in the DUCET, such as Japanese or Arabic, will use
the default UCA collation rules.
If this collator encounters a character for which it has no collation data, it will
sort those characters by pure Unicode value after all characters for which it does have
collation data. For example, if you only loaded in the German collation data (ie. the
data for the Latin script tailored to German) to sort a list of person names, but that
list happens to include the names of a few Japanese people written in Japanese
characters, the Japanese names will sort at the end of the list after all German names,
and will sort according to the Unicode values of the characters.