Unitex/GramLab is an open source, cross-platform, multilingual, lexicon- and grammar-based corpus processing suite
Unitex/GramLab has been selected as a Google Summer of Code 2016 mentor organization
Google Summer of Code (GSoC) is a global program that offers students stipends to write code for open source projects during summer break. This year, Unitex/GramLab has been selected as a Google Summer of Code mentor organization. If you’re interested in helping with GSoC, mentoring a student, or you are a student, we’d love to hear from you:
On the Unitex/GramLab forum, you can ask and answer questions and post your suggestions about Unitex and GramLab.
According to a study based on 377 job offers for NLP engineers from March 2013 to July 2015, Unitex is among the most expected skills in terms of NLP tools (9%)
3rd Unitex/GramLab Workhop 9-10 octobre 2014,Université François Rabelais de Tours, FranceUniversité; François Rabelais de Tours, France
2nd Unitex/GramLab Workshop 10-11 octobre 2013, Université Paris Est-Marne-la-Vallée
Unitex/GramLab 3.0 stable is now available
What is Unitex ?Unitex is a corpus processing system, based on automata-oriented technology. The concept of this software was born at LADL (Laboratoire d'Automatique Documentaire et Linguistique), under the direction of its director, Maurice Gross. With this tool, you can handle electronic resources such as electronic dictionaries and grammars and apply them. You can work at the levels of morphology, the lexicon and syntax. The main functions are:
What is GramLab ?GramLab IDE is an Integrated Development Environment, based on Unitex software components, designed for industrial purpose. Its main features are :
You can also visit the site of the GramLab projet.
A multilingual platform
Unitex conforms to the Unicode 3.0 standard that allows users to handle virtually all the characters of all languages, including Asian languages. The Unitex programs have been designed to work for all writing rules. There is no difficulty in working with Asian languages, in spite of their particular spacing conventions.
A multi-system software
The Unitex interface is written in Java and all other programs are written in C/C++. This allows Unitex to work on every system that supports Java 1.6 and that can compile C/C++ programs.
Unitex has been tested successfully on Windows (95, 98, NT, 2000, XP, ME and Vista) and Linux, Mac OS X and runs now on Solaris 8 Sparc.
An open source software
Versions 2.1 or newer of Unitex are freely distributed under the terms of the Lesser General Public License (LGPL). This means that everyone can redistribute Unitex freely within the terms of the LGPL license. It also means that you have access to the source code of all the Unitex programs, which is included in the zip file you download. The LGPL license is more permissive than the GPL one, because it allows you to reuse the own code of Unitex in non-free software.
Unitex is mainly developed by Sébastien Paumier at the Institut Gaspard-Monge (IGM), University of Paris-Est Marne-la-Vallée (France). You can consult the full list of contributors in the user manual.
Unitex also benefited from years of research, experiments and publications by Maurice Gross (1989, 1997), Dominique Revuz (1992), Emmanuel Roche (1992, 1997), Max Silberztein (1989, 1991, 1992, 1993, 1994, 1997) and other authors. Unitex would have been useless without the linguistic data (dictionaries and grammars) constructed by the laboratories of the RELEX network.
The locate pattern function was re-used from previous software known as AGLAE.
Adaption for ancient Greek was made by Claude Devis (CENTAL) who also included new code pages (Windows & ISO) in the trancoding program. Claude Devis has also introduced morphological filters into Unitex, using an Open Source regular expression library made by Ville Laurikari.
The RebuildTfst program (previously known as MergeTextAutomaton) was written by Olivier Blanc (IGM).
Integration of the ELAG program was made by Olivier Blanc.
The text editor integrated in Unitex was written by Julien Decreton, who has also developed UNDO and REDO fuctions in the graph editor.
The adaptation for Russian of the PolyLex program was made by Sebastian Nagel, who has also developed a set of Perl programs than can be used to manipulate and visualize automatically generated graphs, and other stuffs around Unitex.
The Tokenize and Dico programs were seriously optimized by Alexis Neme.
A new graph of French sentences has been realized by Anne Dister (CENTAL), Nathalie Friburger and Denis Maurel (LI, University François-Rabelais). French Proper Noun Dictionaries come from the Prolex Project of the University François-Rabelais. Today, the package contains two dictionaries: Toponyms and Countries&Capitals. More details are available on the Unitex page of the TLN website.
Tools for generation of Korean dictionaries were designed by Hyun-Gue Huh.
The Dico program was modified by Alexis Neme in order to allow the use of dictionary graphs.
This software is used by many people with different goals. Here are a few projects that use Unitex (see this page for a longer list):
A.W. Appel, G.J. Jacobson. 1988. The world's fastest Scrabble program, Comm. ACM 31(5), pp. 572-578 & 585.
Dister, Anne. 1998. Problématique des fins de phrase en traitement automatique du français. In À qui appartient la ponctuation ? Actes du colloque international et interdisciplinaire de Liège (13-15 mars 1997), pp. 437-447, Bruxelles : Duculot, Champs linguistiques.
Friburger, Nathalie; Dister, Anne; Maurel, Denis. 2000. Améliorer le découpage en phrases sous Intex. In Actes des troisièmes journées Intex, Liège, 13-14 juin 2000 (Anne Dister Ed.), in Revue, Informatique et Statistiques dans les sciences humaines 36 n°1-4, pp. 181-200.
Maurice Gross. 1989. The Use of Finite Automata in the Lexical Representation of Natural Language. In Electronic Dictionaries and Automata in Computational Linguistics, Lecture Notes in Computer Science 377, pp. 34-50, Berlin/New York: Springer.
Maurice Gross. 1997. The Construction of Local Grammars, in E.Roche et Y.Schabes (eds.), Finite-State Language Processing, Cambridge, Mass./London, The MIT Press, pp. 329-352.
Cláudio L. Lucchesi, Tomasz Kowaltowski. 1993. Applications of finite automata representing large vocabularies. Software - Practice and Experience 23(1), pp. 15-30, Wiley & Sons.
Sébastien Paumier. 2000. Nouvelles méthodes pour la recherche d'expressions dans de grands corpus. In Actes des troisièmes journées Intex, Liège, 13-14 juin 2000 (Anne Dister Ed.), in Revue, Informatique et Statistiques dans les sciences humaines 36 n°1-4, pp. 289-295.
Dominique Revuz. 1992. Minimization of acyclic deterministic automata in linear time. Theoretical Comput. Sci., vol. 92, n# 27 1, pp. 181-189.
Emmanuel Roche. 1992. Text disambiguation by finite-state automata: an algorithm and experiments on corpora. In COLING-92. Proceedings of the Conference, Nantes.
Emmanuel Roche. 1997. Parsing with finite state transducers. In E. Roche and Y. Schabes (eds.), Finite-State Language Processing, Cambridge, Mass./London, The MIT Press, pp. 241-281.
Max Silberztein. 1989. The lexical analysis of French, in Electronic Dictionaries and Automata in Computational Linguistics, Lectures Notes in Computer Science 377, Berlin/New York: Springer.
Max Silberztein. 1991. A new approach to tagging: the use of a large-coverage electronic dictionary, Applied Computer Translation 1(4).
Max Silberztein. 1992. Finite state descriptions of various levels of linguistic phenomena, Language Research 28(4), Seoul National University, pp. 731-748.
Max Silberztein. 1993. Dictionnaires électroniques et analyse automatique de textes. Le système INTEX, Paris, Masson, 234 p.
Max Silberztein. 1994. INTEX: a corpus processing system, in COLING 94 Proceedings, Kyoto, Japan.
Max Silberztein. 1997. The Lexical Analysis of Natural Languages, in Finite-State Language Processing, E. Roche and Y. Schabes (eds.), Cambridge, Mass./London, MIT Press, pp. 175-203.
|University Paris-Est Marne-la-Vallée IGM Our NLP team||Unitex/GramLab forum|