Why Unitex ?
How the creation of Unitex has benefited to the scientific community
Right from its creation in 2002, Unitex has provided the following features:
Unitex is an open-source system
The initial version of Unitex was written in 2002 so as to reproduce the functionality of
Intex, which was not open-source.
You can use Unitex while preserving your independence from its authors. If you need to recompile the system due to technical changes, you can do it from the source code which is openly distributed (hence the term 'open source'). You are protected from the whims of a person or of an institution.
Software developers can improve and correct Unitex. Here are a few contributions to Unitex through collaborative development since 2002. The PolyLex module has been adapted to Russian by Sebastian Nagel, a Russian-speaking developer. Thanks to Claude Devis, a developer working with Ancient Greek texts, you can produce concordances containing references to parts of a corpus. A user of a 64-bit computer made the Unitex code compatible with this type of machine. Agata Savary, a specialist of inflection of multi-word units, contributed the Multiflex module. It would not be reasonable to ask a single developer to deal with all the aspects of such a complex system. Unitex can be used by other systems (Glossanet, EELO and SMELL use Unitex), and Unitex makes use of independently existing open-source systems (TRE and Xalign).
The methods implemented in an open-source system are transparent, i.e. open to the knowledge of all. Anyone can discuss them and design improvements. There are no trade secrets about them.
The code of Unitex is distributed with the LGPL license. This allows users to include this code in commercial products, free of charge, and facilitates the use of Unitex to solve real-world problems.
Unitex works on Windows, Linux, MacOS and other systems
Unitex works on any operating system that supports Java and can compile C language. In particular, it can be run on Linux, i.e. without paying a license for an operating system. The price of such a license can be an obstacle to the use of a system by universities, since it has to be multiplied by the number of students that use the system. In addition, Unitex can be run on Linux without using an emulator of another operating system. (When programs are implemented for Windows only, they can be run on Linux by using a Windows emulator, but this requires a Windows license, and prevents interaction with other programs running directly on Linux.) For example, the Glossanet system runs Unitex with other programs directly on Linux.
The Unitex search function was innovative as regards performances
The search function of Unitex was re-used from the Aglaé system. The search function of Aglaé used an innovative representation of tokens and was significantly quicker than comparable systems (Paumier, 2000, 2003). Later, the search function was further optimized by Alexis Némé. The search function of Unitex is very quick, even with large corpora and large lexicons, especially when run on Linux. Searching a 12-million-word corpus for expressions described in a 600-state, 3600-transition grammar typically takes 1 mn.
Paumier, Sébastien. 2000. "Nouvelles méthodes pour la recherche d'expressions dans de grands corpus", In Anne Dister (éd.), Revue Informatique et Statistique dans les Sciences humaines 36, Actes des troisièmes journées INTEX, Liège, 2000, pp. 289-295.
Paumier, Sébastien. 2003. "A Time-Efficient Token Representation for Parsers", Proceedings of the EACL Workshop on Finite-State Methods in Natural Language Processing, Budapest, pp. 83-90.
Unitex adopted Unicode for character encoding
Unitex can process Chinese, Korean, multilingual text with characters from several alphabets, e.g. Latin and Greek alphabets, without having to encode an alphabet into another. Special typographical characters, such as '—', do not pose particular problems when they happen to occur in text.
Unitex adapts its tokenization strategy in function of the language of the text
Languages written without word delimiters, such as Chinese, Japanese and Thai, require specific tokenization strategies as compared to English. Tokenizing a text consists of segmenting it into elementary units before further analysis, and these elementary units are usually words.
The creation of Unitex was an incentive to release lexicons with the LGPLLR license
After the LGPL license was adopted, it appeared that some of the electronic lexicons which are required for Unitex to operate should also be released for free with a user license. Various institutions released their lexicons under the LGPLLR license.
|University Paris-Est Marne-la-Vallée IGM Our NLP team||visits on the web site since release of version 2.0||Contact: firstname.lastname@example.org|