Class NormString

Extends GlyphString.
Create a new normalized string instance. This string inherits from the GlyphString class, and adds the normalize method. It can be used anywhere that a normal Javascript string is used.

The options parameter is optional, and may contain any combination of the following properties:

onLoad - a callback function to call when the locale data are fully loaded. When the onLoad option is given, this object will attempt to load any missing locale data using the ilib loader callback. When the constructor is done (even if the data is already preassembled), the onLoad function is called with the current instance as a parameter, so this callback can be used with preassembled or dynamic loading or a mix of the two.
sync - tell whether to load any missing locale data synchronously or asynchronously. If this option is given as "false", then the "onLoad" callback must be given, as the instance returned from this constructor will not be usable for a while.
loadParams - an object containing parameters to pass to the loader callback function when locale data is missing. The parameters are not interpretted or modified in any way. They are simply passed along. The object may contain any property/value pairs as long as the calling code is in agreement with the loader callback function as to what those parameters mean.

Defined in: NormString.js.

Class Summary
Constructor Attributes	Constructor Name and Description
	NormString(str, options)

Method Summary
Method Attributes	Method Name and Description
	charIterator()
<static>	NormString.init(options) Initialize the normalized string routines statically.
	normalize(form) Perform the Unicode Normalization Algorithm upon the string and return the resulting new string.

Methods borrowed from class GlyphString:: ellipsize, truncate
Methods borrowed from class IString:: charAt, charCodeAt, codePointAt, codePointLength, concat, forEach, forEachCodePoint, format, formatChoice, getLocale, indexOf, iterator, lastIndexOf, match, replace, search, setLocale, slice, split, substr, substring, toLowerCase, toString, toUpperCase, valueOf

Class Detail

NormString(str, options)

Parameters:
{string|IString=} str: initialize this instance with this string
{Object=} options: options governing the way this instance works

Method Detail

{Object} charIterator()

Returns:: {Object} an iterator that iterates through all the characters in the string

<static> NormString.init(options)

Initialize the normalized string routines statically. This is intended to be called in a dynamic-load version of ilib to load the data needed to normalize strings before any instances of NormString are created.

The options parameter may contain any of the following properties:

form - {string} the normalization form to load
script - {string} load the normalization for this script. If the script is given as "all" then the normalization data for all scripts is loaded at the same time
sync - {boolean} whether to load the files synchronously or not
loadParams - {Object} parameters to the loader function
onLoad - {function()} a function to call when the files are done being loaded

Parameters:
{Object} options: an object containing properties that govern how to initialize the data

{IString} normalize(form)

Perform the Unicode Normalization Algorithm upon the string and return the resulting new string. The current string is not modified.

Forms

The forms of possible normalizations are defined by the Unicode Standard Annex (UAX) 15. The form parameter is a string that may have one of the following values:

nfd - Canonical decomposition. This decomposes characters into their exactly equivalent forms. For example, "ü" would decompose into a "u" followed by the combining diaeresis character.
nfc - Canonical decomposition followed by canonical composition. This decomposes and then recomposes character into their shortest exactly equivalent forms by recomposing as many combining characters as possible. For example, "ü" followed by a combining macron character would decompose into a "u" followed by the combining macron characters the combining diaeresis character, and then be recomposed into the u with macron and diaeresis "ṻ" character. The reason that the "nfc" form decomposes and then recomposes is that combining characters have a specific order under the Unicode Normalization Algorithm, and partly composed characters such as the "ü" followed by combining marks may change the order of the combining marks when decomposed and recomposed.
nfkd - Compatibility decomposition. This decomposes characters into compatible forms that may not be exactly equivalent semantically, as well as performing canonical decomposition as well. For example, the "œ" ligature character decomposes to the two characters "oe" because they are compatible even though they are not exactly the same semantically.
nfkc - Compatibility decomposition followed by canonical composition. This decomposes characters into compatible forms, then recomposes characters using the canonical composition. That is, it breaks down characters into the compatible forms, and then recombines all combining marks it can with their base characters. For example, the character "ǽ" would be normalized to "aé" by first decomposing the character into "a" followed by "e" followed by the combining acute accent combining mark, and then recomposed to an "a" followed by the "e" with acute accent.

Operation

Two strings a and b can be said to be canonically equivalent if normalize(a) = normalize(b) under the nfc normalization form. Two strings can be said to be compatible if normalize(a) = normalize(b) under the nfkc normalization form.

The canonical normalization is often used to see if strings are equivalent to each other, and thus is useful when implementing parsing algorithms or exact matching algorithms. It can also be used to ensure that any string output produces a predictable sequence of characters.

Compatibility normalization does not always preserve the semantic meaning of all the characters, although this is sometimes the behaviour that you are after. It is useful, for example, when doing searches of user-input against text in documents where the matches are supposed to "fuzzy". In this case, both the query string and the document string would be mapped to their compatibility normalized forms, and then compared.

Compatibility normalization also does not guarantee round-trip conversion to and from legacy character sets as the normalization is "lossy". It is akin to doing a lower- or upper-case conversion on text -- after casing, you cannot tell what case each character is in the original string. It is good for matching and searching, but it rarely good for output because some distinctions or meanings in the original text have been lost.

Note that W3C normalization for HTML also escapes and unescapes HTML character entities such as "ü" for u with diaeresis. This method does not do such escaping or unescaping. If normalization is required for HTML strings with entities, unescaping should be performed on the string prior to calling this method.

Data

Normalization requires a fair amount of mapping data, much of which you may not need for the characters expected in your texts. It is possible to assemble a copy of ilib that saves space by only including normalization data for those scripts that you expect to encounter in your data.

The normalization data is organized by normalization form and within there by script. To include the normalization data for a particular script with a particular normalization form, use the following require:


NormString.init({
  form: "<form>",
  script: "<script>"
});

Where <form> is the normalization form ("nfd", "nfc", "nfkd", or "nfkc"), and <script> is the ISO 15924 code for the script you would like to support. Example: to load in the NFC data for Cyrillic, you would use:


NormString.init({
  form: "nfc",
  script: "Cyrl"
});

Note that because certain normalization forms include others in their algorithm, their data also depends on the data for the other forms. For example, if you include the "nfc" data for a script, you will automatically get the "nfd" data for that same script as well because the NFC algorithm does NFD normalization first. Here are the dependencies:

NFD -> no dependencies
NFC -> NFD
NFKD -> NFD
NFKC -> NFKD, NFD, NFC

A special value for the script dependency is "all" which will cause the data for all scripts to be loaded for that normalization form. This would be useful if you know that you are going to normalize a lot of multilingual text or cannot predict which scripts will appear in the input. Because the NFKC form depends on all others, you can get all of the data for all forms automatically by depending on "nfkc/all.js". Note that the normalization data for practically all script automatically depend on data for the Common script (code "Zyyy") which contains all of the characters that are commonly used in many different scripts. Examples of characters in the Common script are the ASCII punctuation characters, or the ASCII Arabic numerals "0" through "9".

By default, none of the data for normalization is automatically included in the preassembled ilib files. (For size "full".) If you would like to normalize strings, you must assemble your own copy of ilib and explicitly include the normalization data for those scripts. This normalization method will produce output, even without the normalization data. However, the output will be simply the same thing as its input for all scripts except Korean Hangul and Jamo, which are decomposed and recomposed algorithmically and therefore do not rely on data.

If characters are encountered for which there are no normalization data, they will be passed through to the output string unmodified.

Parameters:
{string} form: The normalization form requested

Returns:: {IString} a new instance of an IString that has been normalized according to the requested form. The current instance is not modified.

Classes

Class NormString

Forms

Operation

Data