Class Index | File Index

Classes


Class NormString


Extends GlyphString.
Create a new normalized string instance. This string inherits from the GlyphString class, and adds the normalize method. It can be used anywhere that a normal Javascript string is used.


Defined in: ilib-full-dyn.js.

Class Summary
Constructor Attributes Constructor Name and Description
 
NormString(str)
Method Summary
Method Attributes Method Name and Description
 
<static>  
NormString.init(options)
Initialize the normalized string routines statically.
 
normalize(form)
Perform the Unicode Normalization Algorithm upon the string and return the resulting new string.
Methods borrowed from class GlyphString:
ellipsize, truncate
Methods borrowed from class IString:
charAt, charCodeAt, codePointAt, codePointLength, concat, forEach, forEachCodePoint, format, formatChoice, getLocale, indexOf, iterator, lastIndexOf, match, replace, search, setLocale, slice, split, substr, substring, toLowerCase, toString, toUpperCase, valueOf
Class Detail
NormString(str)
Parameters:
{string|IString=} str
initialize this instance with this string
Method Detail
{Object} charIterator()
Returns:
{Object} an iterator that iterates through all the characters in the string

<static> NormString.init(options)
Initialize the normalized string routines statically. This is intended to be called in a dynamic-load version of ilib to load the data need to normalize strings before any instances of NormString are created.

The options parameter may contain any of the following properties:

Parameters:
{Object} options
an object containing properties that govern how to initialize the data

{IString} normalize(form)
Perform the Unicode Normalization Algorithm upon the string and return the resulting new string. The current string is not modified.

Forms

The forms of possible normalizations are defined by the Unicode Standard Annex (UAX) 15. The form parameter is a string that may have one of the following values:

Operation

Two strings a and b can be said to be canonically equivalent if normalize(a) = normalize(b) under the nfc normalization form. Two strings can be said to be compatible if normalize(a) = normalize(b) under the nfkc normalization form.

The canonical normalization is often used to see if strings are equivalent to each other, and thus is useful when implementing parsing algorithms or exact matching algorithms. It can also be used to ensure that any string output produces a predictable sequence of characters.

Compatibility normalization does not always preserve the semantic meaning of all the characters, although this is sometimes the behaviour that you are after. It is useful, for example, when doing searches of user-input against text in documents where the matches are supposed to "fuzzy". In this case, both the query string and the document string would be mapped to their compatibility normalized forms, and then compared.

Compatibility normalization also does not guarantee round-trip conversion to and from legacy character sets as the normalization is "lossy". It is akin to doing a lower- or upper-case conversion on text -- after casing, you cannot tell what case each character is in the original string. It is good for matching and searching, but it rarely good for output because some distinctions or meanings in the original text have been lost.

Note that W3C normalization for HTML also escapes and unescapes HTML character entities such as "&uuml;" for u with diaeresis. This method does not do such escaping or unescaping. If normalization is required for HTML strings with entities, unescaping should be performed on the string prior to calling this method.

Data

Normalization requires a fair amount of mapping data, much of which you may not need for the characters expected in your texts. It is possible to assemble a copy of ilib that saves space by only including normalization data for those scripts that you expect to encounter in your data.

The normalization data is organized by normalization form and within there by script. To include the normalization data for a particular script with a particular normalization form, use the directive:


!depends <form>/<script>.js
Where <form> is the normalization form ("nfd", "nfc", "nfkd", or "nfkc"), and <script> is the ISO 15924 code for the script you would like to support. Example: to load in the NFC data for Cyrillic, you would use:

!depends nfc/Cyrl.js
Note that because certain normalization forms include others in their algorithm, their data also depends on the data for the other forms. For example, if you include the "nfc" data for a script, you will automatically get the "nfd" data for that same script as well because the NFC algorithm does NFD normalization first. Here are the dependencies:

A special value for the script dependency is "all" which will cause the data for all scripts to be loaded for that normalization form. This would be useful if you know that you are going to normalize a lot of multilingual text or cannot predict which scripts will appear in the input. Because the NFKC form depends on all others, you can get all of the data for all forms automatically by depending on "nfkc/all.js". Note that the normalization data for practically all script automatically depend on data for the Common script (code "Zyyy") which contains all of the characters that are commonly used in many different scripts. Examples of characters in the Common script are the ASCII punctuation characters, or the ASCII Arabic numerals "0" through "9".

By default, none of the data for normalization is automatically included in the preassembled iliball.js file. If you would like to normalize strings, you must assemble your own copy of ilib and explicitly include the normalization data for those scripts as per the instructions above. This normalization method will produce output, even without the normalization data. However, the output will be simply the same thing as its input for all scripts except Korean Hangul and Jamo, which are decomposed and recomposed algorithmically and therefore do not rely on data.

If characters are encountered for which there are no normalization data, they will be passed through to the output string unmodified.

Parameters:
{string} form
The normalization form requested
Returns:
{IString} a new instance of an IString that has been normalized according to the requested form. The current instance is not modified.

Documentation generated by JsDoc Toolkit 2.4.0 on Mon Oct 21 2019 22:58:32 GMT-0700 (PDT)