Class

NormString

NormString(str, optionsopt)

Create a new normalized string instance. This string inherits from the GlyphString class, and adds the normalize method. It can be used anywhere that a normal Javascript string is used.

The options parameter is optional, and may contain any combination of the following properties:

  • onLoad - a callback function to call when the locale data are fully loaded. When the onLoad option is given, this object will attempt to load any missing locale data using the ilib loader callback. When the constructor is done (even if the data is already preassembled), the onLoad function is called with the current instance as a parameter, so this callback can be used with preassembled or dynamic loading or a mix of the two.
  • sync - tell whether to load any missing locale data synchronously or asynchronously. If this option is given as "false", then the "onLoad" callback must be given, as the instance returned from this constructor will not be usable for a while.
  • loadParams - an object containing parameters to pass to the loader callback function when locale data is missing. The parameters are not interpretted or modified in any way. They are simply passed along. The object may contain any property/value pairs as long as the calling code is in agreement with the loader callback function as to what those parameters mean.
Constructor

# new NormString(str, optionsopt)

Parameters:
Name Type Attributes Description
str string | IString

initialize this instance with this string

options Object <optional>

options governing the way this instance works

View Source NormString.js, line 63

Extends

Methods

# charAt(index) → {IString}

Same as String.charAt()

Parameters:
Name Type Description
index number

the index of the character being sought

Inherited From:

View Source IString.js, line 983

the character at the given index

IString

# charCodeAt(index) → {number}

Same as String.charCodeAt(). This only reports on 2-byte UCS-2 Unicode values, and does not take into account supplementary characters encoded in UTF-16. If you would like to take account of those characters, use codePointAt() instead.

Parameters:
Name Type Description
index number

the index of the character being sought

Inherited From:

View Source IString.js, line 997

the character code of the character at the given index in the string

number

# codePointAt(index) → {number}

Return the code point at the given index when the string is viewed as an array of code points. If the index is beyond the end of the array of code points or if the index is negative, -1 is returned.

Parameters:
Name Type Description
index number

index of the code point

Inherited From:

View Source IString.js, line 1456

code point of the character at the given index into the string

number

# codePointLength() → {number}

Return the number of code points in this string. This may be different than the number of characters, as the UTF-16 encoding that Javascript uses for its basis returns surrogate pairs separately. Two 2-byte surrogate characters together make up one character/code point in the supplementary character planes. If your string contains no characters in the supplementary planes, this method will return the same thing as the length() method.

Inherited From:

View Source IString.js, line 1520

the number of code points in this string

number

# concat(strings) → {IString}

Same as String.concat()

Parameters:
Name Type Description
strings string

strings to concatenate to the current one

Inherited From:

View Source IString.js, line 1006

a concatenation of the given strings

IString

# ellipsize(length) → {string}

Truncate the current string at the given number of glyphs and add an ellipsis to indicate that is more to the string. The ellipsis forms the last character in the string, so the string is actually truncated at length-1 glyphs.

Parameters:
Name Type Description
length number

the number of whole glyphs to keep in the string including the ellipsis

Inherited From:

View Source GlyphString.js, line 434

a string truncated to the requested number of glyphs with an ellipsis

string

# endsWith() → {boolean}

Same as String.endsWith().

Inherited From:

View Source IString.js, line 1154

true if the given characters are found at the end of the string, and false otherwise

boolean

# forEach(callback)

Call the callback with each character in the string one at a time, taking care to step through the surrogate pairs in the UTF-16 encoding properly.

The standard Javascript String's charAt() method only returns a particular 16-bit character in the UTF-16 encoding scheme. If the index to charAt() is pointing to a low- or high-surrogate character, it will return the surrogate character rather than the the character in the supplementary planes that the two surrogates together encode. This function will call the callback with the full character, making sure to join two surrogates into one character in the supplementary planes where necessary.

Parameters:
Name Type Description
callback function

a callback function to call with each full character in the current string

Inherited From:

View Source IString.js, line 1320

# forEachCodePoint(callback)

Call the callback with each numeric code point in the string one at a time, taking care to step through the surrogate pairs in the UTF-16 encoding properly.

The standard Javascript String's charCodeAt() method only returns information about a particular 16-bit character in the UTF-16 encoding scheme. If the index to charCodeAt() is pointing to a low- or high-surrogate character, it will return the code point of the surrogate character rather than the code point of the character in the supplementary planes that the two surrogates together encode. This function will call the callback with the full code point of each character, making sure to join two surrogates into one code point in the supplementary planes.

Parameters:
Name Type Description
callback function

a callback function to call with each code point in the current string

Inherited From:

View Source IString.js, line 1349

# format(params)

Format this string instance as a message, replacing the parameters with the given values.

The string can contain any text that a regular Javascript string can contain. Replacement parameters have the syntax:

{name}

Where "name" can be any string surrounded by curly brackets. The value of "name" is taken from the parameters argument.

Example:

var str = new IString("There are {num} objects.");
console.log(str.format({
  num: 12
});

Would give the output:

There are 12 objects.

If a property is missing from the parameter block, the replacement parameter substring is left untouched in the string, and a different set of parameters may be applied a second time. This way, different parts of the code may format different parts of the message that they happen to know about.

Example:

var str = new IString("There are {num} objects in the {container}.");
console.log(str.format({
  num: 12
});

Would give the output:

There are 12 objects in the {container}.

The result can then be formatted again with a different parameter block that specifies a value for the container property.

Parameters:
Name Type Description
params

a Javascript object containing values for the replacement parameters in the current string

Inherited From:

View Source IString.js, line 609

a new IString instance with as many replacement parameters filled out as possible with real values.

# formatChoice(argIndex, params) → {string}

Format a string as one of a choice of strings dependent on the value of a particular argument index or array of indices.

The syntax of the choice string is as follows. The string contains a series of choices separated by a vertical bar character "|". Each choice has a value or range of values to match followed by a hash character "#" followed by the string to use if the variable matches the criteria.

Example string:

var num = 2;
var str = new IString("0#There are no objects.|1#There is one object.|2#There are {number} objects.");
console.log(str.formatChoice(num, {
  number: num
}));

Gives the output:

"There are 2 objects."

The strings to format may contain replacement variables that will be formatted using the format() method above and the params argument as a source of values to use while formatting those variables.

If the criterion for a particular choice is empty, that choice will be used as the default one for use when none of the other choice's criteria match.

Example string:

var num = 22;
var str = new IString("0#There are no objects.|1#There is one object.|#There are {number} objects.");
console.log(str.formatChoice(num, {
  number: num
}));

Gives the output:

"There are 22 objects."

If multiple choice patterns can match a given argument index, the first one encountered in the string will be used. If no choice patterns match the argument index, then the default choice will be used. If there is no default choice defined, then this method will return an empty string.

Special Syntax

For any choice format string, all of the patterns in the string should be of a single type: numeric, boolean, or string/regexp. The type of the patterns is determined by the type of the argument index parameter.

If the argument index is numeric, then some special syntax can be used in the patterns to match numeric ranges.

  • >x - match any number that is greater than x
  • >=x - match any number that is greater than or equal to x
  • <x - match any number that is less than x
  • <=x - match any number that is less than or equal to x
  • start-end - match any number in the range [start,end)
  • zero - match any number in the class "zero". (See below for a description of number classes.)
  • one - match any number in the class "one"
  • two - match any number in the class "two"
  • few - match any number in the class "few"
  • many - match any number in the class "many"
  • other - match any number in the other or default class

A number class defines a set of numbers that receive a particular syntax in the strings. For example, in Slovenian, integers ending in the digit "1" are in the "one" class, including 1, 21, 31, ... 101, 111, etc. Similarly, integers ending in the digit "2" are in the "two" class. Integers ending in the digits "3" or "4" are in the "few" class, and every other integer is handled by the default string.

The definition of what numbers are included in a class is locale-dependent. They are defined in the data file plurals.json. If your string is in a different locale than the default for ilib, you should call the setLocale() method of the string instance before calling this method.

Other Pattern Types

If the argument index is a boolean, the string values "true" and "false" may appear as the choice patterns.

If the argument index is of type string, then the choice patterns may contain regular expressions, or static strings as degenerate regexps.

Multiple Indexes

If you have 2 or more indexes to format into a string, you can pass them as an array. When you do that, the patterns to match should be a comma-separate list of patterns as per the rules above.

Example string:

var str = new IString("zero,zero#There are no objects on zero pages.|one,one#There is 1 object on 1 page.|other,one#There are {number} objects on 1 page.|#There are {number} objects on {pages} pages.");
var num = 4, pages = 1;
console.log(str.formatChoice([num, pages], {
  number: num,
  pages: pages
}));

Gives the output:

"There are 4 objects on 1 page."

Note that when there is a single index, you would typically leave the pattern blank to indicate the default choice. When there are multiple indices, sometimes one of the patterns has to be the default case when the other is not. Rather than leaving one or more of the patterns blank with commas that look out-of-place in the middle of it, you can use the word "other" to indicate a match with the default or other choice. The above example shows the use of the "other" pattern. That said, you are allowed to leave the pattern blank if you so choose. In the example above, the pattern for the third string could easily have been written as ",one" instead of "other,one" and the result will be the same.

Parameters:
Name Type Description
argIndex * | Array.<*>

The index into the choice array of the current parameter, or an array of indices

params Object

The hash of parameter values that replace the replacement variables in the string

  • @param {boolean} useIntlPlural [optional] true if you are willing to use Intl.PluralRules object If it is omitted, the default value is true
Inherited From:

View Source IString.js, line 850

"syntax error in choice format pattern: " if there is a syntax error

the formatted string

string

# getLocale() → {string}

Return the locale to use when processing choice formats. The locale affects how number classes are interpretted. In some cultures, the limit "few" maps to "any integer that ends in the digits 2 to 9" and in yet others, "few" maps to "any integer that ends in the digits 3 or 4".

Inherited From:

View Source IString.js, line 1506

localespec to use when processing choice formats with this string

string

# includes() → {boolean}

Same as String.includes().

Inherited From:

View Source IString.js, line 1178

true if the search string is found anywhere with the given string, and false otherwise

boolean

# indexOf(searchValue, start) → {number}

Same as String.indexOf()

Parameters:
Name Type Description
searchValue string

string to search for

start number

index into the string to start searching, or undefined to search the entire string

Inherited From:

View Source IString.js, line 1018

index into the string of the string being sought, or -1 if the string is not found

number

# iterator() → {Object}

Return an iterator that will step through all of the characters in the string one at a time and return their code points, taking care to step through the surrogate pairs in UTF-16 encoding properly.

The standard Javascript String's charCodeAt() method only returns information about a particular 16-bit character in the UTF-16 encoding scheme. If the index is pointing to a low- or high-surrogate character, it will return a code point of the surrogate character rather than the code point of the character in the supplementary planes that the two surrogates together encode.

The iterator instance returned has two methods, hasNext() which returns true if the iterator has more code points to iterate through, and next() which returns the next code point as a number.

Inherited From:

View Source IString.js, line 1380

an iterator that iterates through all the code points in the string

Object

# lastIndexOf(searchValue, start) → {number}

Same as String.lastIndexOf()

Parameters:
Name Type Description
searchValue string

string to search for

start number

index into the string to start searching, or undefined to search the entire string

Inherited From:

View Source IString.js, line 1030

index into the string of the string being sought, or -1 if the string is not found

number

# match(regexp) → {Array.<string>}

Same as String.match()

Parameters:
Name Type Description
regexp string

the regular expression to match

Inherited From:

View Source IString.js, line 1039

an array of matches

Array.<string>

# matchAll(regexp) → {iterator}

Same as String.matchAll()

Parameters:
Name Type Description
regexp string

the regular expression to match

Inherited From:

View Source IString.js, line 1048

an iterator of the matches

iterator

# normalize(form) → {IString}

Perform the Unicode Normalization Algorithm upon the string and return the resulting new string. The current string is not modified.

Forms

The forms of possible normalizations are defined by the Unicode Standard Annex (UAX) 15. The form parameter is a string that may have one of the following values:

  • nfd - Canonical decomposition. This decomposes characters into their exactly equivalent forms. For example, "ü" would decompose into a "u" followed by the combining diaeresis character.
  • nfc - Canonical decomposition followed by canonical composition. This decomposes and then recomposes character into their shortest exactly equivalent forms by recomposing as many combining characters as possible. For example, "ü" followed by a combining macron character would decompose into a "u" followed by the combining macron characters the combining diaeresis character, and then be recomposed into the u with macron and diaeresis "ṻ" character. The reason that the "nfc" form decomposes and then recomposes is that combining characters have a specific order under the Unicode Normalization Algorithm, and partly composed characters such as the "ü" followed by combining marks may change the order of the combining marks when decomposed and recomposed.
  • nfkd - Compatibility decomposition. This decomposes characters into compatible forms that may not be exactly equivalent semantically, as well as performing canonical decomposition as well. For example, the "œ" ligature character decomposes to the two characters "oe" because they are compatible even though they are not exactly the same semantically.
  • nfkc - Compatibility decomposition followed by canonical composition. This decomposes characters into compatible forms, then recomposes characters using the canonical composition. That is, it breaks down characters into the compatible forms, and then recombines all combining marks it can with their base characters. For example, the character "ǽ" would be normalized to "aé" by first decomposing the character into "a" followed by "e" followed by the combining acute accent combining mark, and then recomposed to an "a" followed by the "e" with acute accent.

Operation

Two strings a and b can be said to be canonically equivalent if normalize(a) = normalize(b) under the nfc normalization form. Two strings can be said to be compatible if normalize(a) = normalize(b) under the nfkc normalization form.

The canonical normalization is often used to see if strings are equivalent to each other, and thus is useful when implementing parsing algorithms or exact matching algorithms. It can also be used to ensure that any string output produces a predictable sequence of characters.

Compatibility normalization does not always preserve the semantic meaning of all the characters, although this is sometimes the behaviour that you are after. It is useful, for example, when doing searches of user-input against text in documents where the matches are supposed to "fuzzy". In this case, both the query string and the document string would be mapped to their compatibility normalized forms, and then compared.

Compatibility normalization also does not guarantee round-trip conversion to and from legacy character sets as the normalization is "lossy". It is akin to doing a lower- or upper-case conversion on text -- after casing, you cannot tell what case each character is in the original string. It is good for matching and searching, but it rarely good for output because some distinctions or meanings in the original text have been lost.

Note that W3C normalization for HTML also escapes and unescapes HTML character entities such as "&uuml;" for u with diaeresis. This method does not do such escaping or unescaping. If normalization is required for HTML strings with entities, unescaping should be performed on the string prior to calling this method.

Data

Normalization requires a fair amount of mapping data, much of which you may not need for the characters expected in your texts. It is possible to assemble a copy of ilib that saves space by only including normalization data for those scripts that you expect to encounter in your data.

The normalization data is organized by normalization form and within there by script. To include the normalization data for a particular script with a particular normalization form, use the following require:


NormString.init({
  form: "<form>",
  script: "<script>"
});

Where <form&gt is the normalization form ("nfd", "nfc", "nfkd", or "nfkc"), and <script> is the ISO 15924 code for the script you would like to support. Example: to load in the NFC data for Cyrillic, you would use:


NormString.init({
  form: "nfc",
  script: "Cyrl"
});

Note that because certain normalization forms include others in their algorithm, their data also depends on the data for the other forms. For example, if you include the "nfc" data for a script, you will automatically get the "nfd" data for that same script as well because the NFC algorithm does NFD normalization first. Here are the dependencies:

  • NFD -> no dependencies
  • NFC -> NFD
  • NFKD -> NFD
  • NFKC -> NFKD, NFD, NFC

A special value for the script dependency is "all" which will cause the data for all scripts to be loaded for that normalization form. This would be useful if you know that you are going to normalize a lot of multilingual text or cannot predict which scripts will appear in the input. Because the NFKC form depends on all others, you can get all of the data for all forms automatically by depending on "nfkc/all.js". Note that the normalization data for practically all script automatically depend on data for the Common script (code "Zyyy") which contains all of the characters that are commonly used in many different scripts. Examples of characters in the Common script are the ASCII punctuation characters, or the ASCII Arabic numerals "0" through "9".

By default, none of the data for normalization is automatically included in the preassembled ilib files. (For size "full".) If you would like to normalize strings, you must assemble your own copy of ilib and explicitly include the normalization data for those scripts. This normalization method will produce output, even without the normalization data. However, the output will be simply the same thing as its input for all scripts except Korean Hangul and Jamo, which are decomposed and recomposed algorithmically and therefore do not rely on data.

If characters are encountered for which there are no normalization data, they will be passed through to the output string unmodified.

Parameters:
Name Type Description
form string

The normalization form requested

Overrides:

View Source NormString.js, line 371

a new instance of an IString that has been normalized according to the requested form. The current instance is not modified.

IString

# padEnd() → {string}

Same as String.padEnd().

Inherited From:

View Source IString.js, line 1198

a string of the specified length with the pad string applied at the end of the current string

string

# padStart() → {string}

Same as String.padStart().

Inherited From:

View Source IString.js, line 1207

a string of the specified length with the pad string applied at the end of the current string

string

# repeat() → {string}

Same as String.repeat().

Inherited From:

View Source IString.js, line 1216

a new string containing the specified number of copies of the given string

string

# replace(searchValue, newValue) → {IString}

Same as String.replace()

Parameters:
Name Type Description
searchValue string

a regular expression to search for

newValue string

the string to replace the matches with

Inherited From:

View Source IString.js, line 1059

a new string with all the matches replaced with the new value

IString

Same as String.search()

Parameters:
Name Type Description
regexp string

the regular expression to search for

Inherited From:

View Source IString.js, line 1068

position of the match, or -1 for no match

number

# setLocale(locale, syncopt, loadParamsopt, onLoadopt)

Set the locale to use when processing choice formats. The locale affects how number classes are interpretted. In some cultures, the limit "few" maps to "any integer that ends in the digits 2 to 9" and in yet others, "few" maps to "any integer that ends in the digits 3 or 4".

Parameters:
Name Type Attributes Description
locale Locale | string

locale to use when processing choice formats with this string

sync boolean <optional>

[optional] whether to load the locale data synchronously or not

loadParams Object <optional>

[optional] parameters to pass to the loader function

onLoad function <optional>

[optional] function to call when the loading is done

Inherited From:

View Source IString.js, line 1482

# slice(start, end) → {IString}

Same as String.slice()

Parameters:
Name Type Description
start number

first character to include in the string

end number

include all characters up to, but not including the end character

Inherited From:

View Source IString.js, line 1079

a slice of the current string

IString

# split(separator, limit) → {Array.<string>}

Same as String.split()

Parameters:
Name Type Description
separator string

regular expression to match to find separations between the parts of the text

limit number

maximum number of items in the final output array. Any items beyond that limit will be ignored.

Inherited From:

View Source IString.js, line 1092

the parts of the current string split by the separator

Array.<string>

# startsWith() → {boolean}

Same as String.startsWith().

Inherited From:

View Source IString.js, line 1169

true if the given characters are found at the beginning of the string, and false otherwise

boolean

# substr(start, length) → {IString}

Same as String.substr()

Parameters:
Name Type Description
start number

the index of the character that should begin the returned substring

length number

the number of characters to return after the start character.

Inherited From:

View Source IString.js, line 1104

the requested substring

IString

# substring(from, to) → {IString}

Same as String.substring()

Parameters:
Name Type Description
from number

the index of the character that should begin the returned substring

to number

the index where to stop the extraction. If omitted, extracts the rest of the string

Inherited From:

View Source IString.js, line 1124

the requested substring

IString

# toLocaleLowerCase() → {string}

Same as String.toLocaleLowerCase(). If the JS engine does not support this method, you can use the ilib CaseMapper class instead.

Inherited From:

View Source IString.js, line 1227

a new string representing the calling string converted to lower case, according to any locale-sensitive case mappings

string

# toLocaleUpperCase() → {string}

Same as String.toLocaleUpperCase(). If the JS engine does not support this method, you can use the ilib CaseMapper class instead.

Inherited From:

View Source IString.js, line 1238

a new string representing the calling string converted to upper case, according to any locale-sensitive case mappings

string

# toLowerCase() → {IString}

Same as String.toLowerCase(). Note that this method is not locale-sensitive.

Inherited From:

View Source IString.js, line 1134

a string with the first character lower-cased

IString

# toString() → {string}

Same as String.toString()

Inherited From:

View Source IString.js, line 966

this instance as regular Javascript string

string

# toUpperCase() → {IString}

Same as String.toUpperCase(). Note that this method is not locale-sensitive. Use toLocaleUpperCase() instead to get locale-sensitive behaviour.

Inherited From:

View Source IString.js, line 1145

a string with the first character upper-cased

IString

# trim() → {string}

Same as String.trim().

Inherited From:

View Source IString.js, line 1247

a new string representing the calling string stripped of whitespace from both ends.

string

# trimEnd() → {string}

Same as String.trimEnd().

Inherited From:

View Source IString.js, line 1256

a new string representing the calling string stripped of whitespace from its (right) end.

string

# trimLeft() → {string}

Same as String.trimLeft().

Inherited From:

View Source IString.js, line 1283

A new string representing the calling string stripped of whitespace from its beginning (left end).

string

# trimRight() → {string}

Same as String.trimRight().

Inherited From:

View Source IString.js, line 1265

a new string representing the calling string stripped of whitespace from its (right) end.

string

# trimStart() → {string}

Same as String.trimStart().

Inherited From:

View Source IString.js, line 1274

A new string representing the calling string stripped of whitespace from its beginning (left end).

string

# truncate(length) → {string}

Truncate the current string at the given number of whole glyphs and return the resulting string.

Parameters:
Name Type Description
length number

the number of whole glyphs to keep in the string

Inherited From:

View Source GlyphString.js, line 399

a string truncated to the requested number of glyphs

string

# valueOf() → {string}

Same as String.valueOf()

Inherited From:

View Source IString.js, line 974

this instance as a regular Javascript string

string

# static init(options)

Initialize the normalized string routines statically. This is intended to be called in a dynamic-load version of ilib to load the data needed to normalize strings before any instances of NormString are created.

The options parameter may contain any of the following properties:

  • form - {string} the normalization form to load
  • script - {string} load the normalization for this script. If the script is given as "all" then the normalization data for all scripts is loaded at the same time
  • sync - {boolean} whether to load the files synchronously or not
  • loadParams - {Object} parameters to the loader function
  • onLoad - {function()} a function to call when the files are done being loaded
Parameters:
Name Type Description
options Object

an object containing properties that govern how to initialize the data

View Source NormString.js, line 94