Motivation

Most software applications need, at a moment or another, to deal with text in a natural language, to analyze it or perform certain operations on it.

Solutions to represent and deal with text in languages based on a latin alphabet appeared very early in the history of software, but not only did they not fulfill all linguistics needs, they only worked for a subset of the languages they were supposed to deal with. Solutions to deal with text in other languages then emerged after that over the years, but were still restricted to a specific language.

Since the late 80's, effort has been made to create an universal solution to represent and deal with text in any language, in particular due to internationalization considerations.

The Unicode standard was thus born, which not only provides means to encode text from any natural language ever created into a digital form, but also categorizes characters, allows to identify graphemes, words, sentences and lines, and allows to perform case conversion and to sort text, either in a language-agnostic or language-tailored manner.

This library aims at providing the mechanisms to deal with text in a natural language, in a language- and platform-agnostic way, using the foundations of the Unicode standard.

In particular, ability to have a proper abstraction of what a character is is deemed important.