Class NormString
Extends
GlyphString.
Create a new normalized string instance. This string inherits from
the GlyphString class, and adds the normalize method. It can be
used anywhere that a normal Javascript string is used.
Defined in: ilib-full-dyn.js.
Constructor Attributes | Constructor Name and Description |
---|---|
NormString(str)
|
Method Attributes | Method Name and Description |
---|---|
<static> |
NormString.init(options)
Initialize the normalized string routines statically.
|
normalize(form)
Perform the Unicode Normalization Algorithm upon the string and return
the resulting new string.
|
- Methods borrowed from class GlyphString:
- ellipsize, truncate
- Methods borrowed from class IString:
- charAt, charCodeAt, codePointAt, codePointLength, concat, forEach, forEachCodePoint, format, formatChoice, getLocale, indexOf, iterator, lastIndexOf, match, replace, search, setLocale, slice, split, substr, substring, toLowerCase, toString, toUpperCase, valueOf
- Parameters:
- {string|IString=} str
- initialize this instance with this string
- Returns:
- {Object} an iterator that iterates through all the characters in the string
The options parameter may contain any of the following properties:
- form - {string} the normalization form to load
- script - {string} load the normalization for this script. If the script is given as "all" then the normalization data for all scripts is loaded at the same time
- sync - {boolean} whether to load the files synchronously or not
- loadParams - {Object} parameters to the loader function
- onLoad - {function()} a function to call when the files are done being loaded
- Parameters:
- {Object} options
- an object containing properties that govern how to initialize the data
Forms
The forms of possible normalizations are defined by the Unicode Standard Annex (UAX) 15. The form parameter is a string that may have one of the following values:- nfd - Canonical decomposition. This decomposes characters into their exactly equivalent forms. For example, "ü" would decompose into a "u" followed by the combining diaeresis character.
- nfc - Canonical decomposition followed by canonical composition. This decomposes and then recomposes character into their shortest exactly equivalent forms by recomposing as many combining characters as possible. For example, "ü" followed by a combining macron character would decompose into a "u" followed by the combining macron characters the combining diaeresis character, and then be recomposed into the u with macron and diaeresis "ṻ" character. The reason that the "nfc" form decomposes and then recomposes is that combining characters have a specific order under the Unicode Normalization Algorithm, and partly composed characters such as the "ü" followed by combining marks may change the order of the combining marks when decomposed and recomposed.
- nfkd - Compatibility decomposition. This decomposes characters into compatible forms that may not be exactly equivalent semantically, as well as performing canonical decomposition as well. For example, the "œ" ligature character decomposes to the two characters "oe" because they are compatible even though they are not exactly the same semantically.
- nfkc - Compatibility decomposition followed by canonical composition. This decomposes characters into compatible forms, then recomposes characters using the canonical composition. That is, it breaks down characters into the compatible forms, and then recombines all combining marks it can with their base characters. For example, the character "ǽ" would be normalized to "aé" by first decomposing the character into "a" followed by "e" followed by the combining acute accent combining mark, and then recomposed to an "a" followed by the "e" with acute accent.
Operation
Two strings a and b can be said to be canonically equivalent if normalize(a) = normalize(b) under the nfc normalization form. Two strings can be said to be compatible if normalize(a) = normalize(b) under the nfkc normalization form.The canonical normalization is often used to see if strings are equivalent to each other, and thus is useful when implementing parsing algorithms or exact matching algorithms. It can also be used to ensure that any string output produces a predictable sequence of characters.
Compatibility normalization does not always preserve the semantic meaning of all the characters, although this is sometimes the behaviour that you are after. It is useful, for example, when doing searches of user-input against text in documents where the matches are supposed to "fuzzy". In this case, both the query string and the document string would be mapped to their compatibility normalized forms, and then compared.
Compatibility normalization also does not guarantee round-trip conversion to and from legacy character sets as the normalization is "lossy". It is akin to doing a lower- or upper-case conversion on text -- after casing, you cannot tell what case each character is in the original string. It is good for matching and searching, but it rarely good for output because some distinctions or meanings in the original text have been lost.
Note that W3C normalization for HTML also escapes and unescapes HTML character entities such as "ü" for u with diaeresis. This method does not do such escaping or unescaping. If normalization is required for HTML strings with entities, unescaping should be performed on the string prior to calling this method.
Data
Normalization requires a fair amount of mapping data, much of which you may not need for the characters expected in your texts. It is possible to assemble a copy of ilib that saves space by only including normalization data for those scripts that you expect to encounter in your data.The normalization data is organized by normalization form and within there by script. To include the normalization data for a particular script with a particular normalization form, use the directive:
!depends <form>/<script>.js
Where <form> is the normalization form ("nfd", "nfc", "nfkd", or "nfkc"), and
<script> is the ISO 15924 code for the script you would like to
support. Example: to load in the NFC data for Cyrillic, you would use:
!depends nfc/Cyrl.js
Note that because certain normalization forms include others in their algorithm,
their data also depends on the data for the other forms. For example, if you
include the "nfc" data for a script, you will automatically get the "nfd" data
for that same script as well because the NFC algorithm does NFD normalization
first. Here are the dependencies:
- NFD -> no dependencies
- NFC -> NFD
- NFKD -> NFD
- NFKC -> NFKD, NFD, NFC
By default, none of the data for normalization is automatically included in the preassembled iliball.js file. If you would like to normalize strings, you must assemble your own copy of ilib and explicitly include the normalization data for those scripts as per the instructions above. This normalization method will produce output, even without the normalization data. However, the output will be simply the same thing as its input for all scripts except Korean Hangul and Jamo, which are decomposed and recomposed algorithmically and therefore do not rely on data.
If characters are encountered for which there are no normalization data, they will be passed through to the output string unmodified.
- Parameters:
- {string} form
- The normalization form requested
- Returns:
- {IString} a new instance of an IString that has been normalized according to the requested form. The current instance is not modified.