Unitex/GramLab is an open source, cross-platform, multilingual, lexicon- and grammar-based corpus processing suite

Unitex/GramLab 3.1 Stable is now available

Unitex/GramLab has been selected as a Google Summer of Code 2016 mentor organization

Google Summer of Code (GSoC) is a global program that offers students stipends to write code for open source projects during summer break. This year, Unitex/GramLab has been selected as a Google Summer of Code mentor organization. If you’re interested in helping with GSoC, mentoring a student, or you are a student, we’d love to hear from you:

If you have any questions, please do not hesitate to post back at the users forum or to send a message to the developers mailing list

On the Unitex/GramLab forum, you can ask and answer questions and post your suggestions about Unitex and GramLab.

Unitex/GramLab Forum

According to a study based on 377 job offers for NLP engineers from March 2013 to July 2015, Unitex is among the most expected skills in terms of NLP tools (9%)

3rd Unitex/GramLab Workhop
9-10 octobre 2014,Université François Rabelais de Tours, France

Université; François Rabelais de Tours, France

2nd Unitex/GramLab Workshop
10-11 octobre 2013, Université Paris Est-Marne-la-Vallée

Présentations :

  • Detecting and Encoding Interpersonal Relations with Unitex/Local Grammars pdf
    Sophia Stotz, Valentina Stuss, Matthias Reinert, University of Paderborn

  • Automatic Annotation of Motion Expressions and Place Named Entities pdf
    Ludovic Moncla, Université de Pau, LIUPPA

  • Unitex at the University of Belgrade pdf
    Duško Vitas and Cvetana Krstev, University of Belgrade

  • GlossaNet 3: a linguistic tool for monitoring online thematic corpora (demo) zip
    Hubert Naets et Cédrick Fairon, Université catholique de Louvain, CENTAL

  • Utilisation avancée des cascades de graphes sous Unitex (CasSys) zip
    Denis Maurel et Nathalie Friburger, Université François-Rabelais de Tours

  • Présentation de la librairie Unitex. Exemples d'utilisation pdf
    Gilles Vollant, Ergonotics

  • Présentation de l'annotateur Unitex UIMA en C++ pptx
    Démonstration de l'intégration de ressources linguistiques dans un annotateur
    Sylvain Surcin, Kwaga

Unitex/GramLab Days Program

Unitex/GramLab 3.0 stable is now available

What is Unitex ?

Unitex is a corpus processing system, based on automata-oriented technology. The concept of this software was born at LADL (Laboratoire d'Automatique Documentaire et Linguistique), under the direction of its director, Maurice Gross. With this tool, you can handle electronic resources such as electronic dictionaries and grammars and apply them. You can work at the levels of morphology, the lexicon and syntax. The main functions are:
  • building, checking and applying electronic dictionaries
  • pattern matching with regular expressions and recursive transition networks
  • applying lexicon-grammar tables
  • handling ambiguity via the text automaton
  • aligning texts
  • building an automaton from a certified corpus

What is GramLab ?

GramLab IDE is an Integrated Development Environment, based on Unitex software components, designed for industrial purpose. Its main features are :
  • a project-oriented design allowing the user to work on several projects in parallel
  • a brand new Eclipse-like interface, remembering the user configuration when restarted
  • integrated SVN support for data sharing, including facilities for comparing and merging graphs
  • possibility to export linguistic components as maven artefacts
  • a workflow approach allowing the user to run with one click a full processing sequence (preprocessing, locate, concordance)
  • utf8 support

From this site, you can download Unitex/GramLab and find information about linguistic data.

You can also visit the site of the GramLab projet.

Why Unitex ?

A multilingual platform

Unitex conforms to the Unicode 3.0 standard that allows users to handle virtually all the characters of all languages, including Asian languages. The Unitex programs have been designed to work for all writing rules. There is no difficulty in working with Asian languages, in spite of their particular spacing conventions.

A multi-system software

The Unitex interface is written in Java and all other programs are written in C/C++. This allows Unitex to work on every system that supports Java 1.6 and that can compile C/C++ programs.

Unitex has been tested successfully on Windows (95, 98, NT, 2000, XP, ME and Vista) and Linux, Mac OS X and runs now on Solaris 8 Sparc.

An open source software

Versions 2.1 or newer of Unitex are freely distributed under the terms of the Lesser General Public License (LGPL). This means that everyone can redistribute Unitex freely within the terms of the LGPL license. It also means that you have access to the source code of all the Unitex programs, which is included in the zip file you download. The LGPL license is more permissive than the GPL one, because it allows you to reuse the own code of Unitex in non-free software.


Unitex is mainly developed by Sébastien Paumier at the Institut Gaspard-Monge (IGM), University of Paris-Est Marne-la-Vallée (France). You can consult the full list of contributors in the user manual.

Unitex also benefited from years of research, experiments and publications by Maurice Gross (1989, 1997), Dominique Revuz (1992), Emmanuel Roche (1992, 1997), Max Silberztein (1989, 1991, 1992, 1993, 1994, 1997) and other authors. Unitex would have been useless without the linguistic data (dictionaries and grammars) constructed by the laboratories of the RELEX network.

The locate pattern function was re-used from previous software known as AGLAE.

Adaption for ancient Greek was made by Claude Devis (CENTAL) who also included new code pages (Windows & ISO) in the trancoding program. Claude Devis has also introduced morphological filters into Unitex, using an Open Source regular expression library made by Ville Laurikari.

The RebuildTfst program (previously known as MergeTextAutomaton) was written by Olivier Blanc (IGM).

The Portuguese version of the manual was translated by Alexis Neme and Oto Araújo Vale (Projeto Relex - Brazil).

Integration of the ELAG program was made by Olivier Blanc.

The text editor integrated in Unitex was written by Julien Decreton, who has also developed UNDO and REDO fuctions in the graph editor.

The adaptation for Russian of the PolyLex program was made by Sebastian Nagel, who has also developed a set of Perl programs than can be used to manipulate and visualize automatically generated graphs, and other stuffs around Unitex.

The Tokenize and Dico programs were seriously optimized by Alexis Neme.

A new graph of French sentences has been realized by Anne Dister (CENTAL), Nathalie Friburger and Denis Maurel (LI, University François-Rabelais). French Proper Noun Dictionaries come from the Prolex Project of the University François-Rabelais. Today, the package contains two dictionaries: Toponyms and Countries&Capitals. More details are available on the Unitex page of the TLN website.

Tools for generation of Korean dictionaries were designed by Hyun-Gue Huh.

The Dico program was modified by Alexis Neme in order to allow the use of dictionary graphs.

Related works

This software is used by many people with different goals. Here are a few projects that use Unitex (see this page for a longer list):


A.W. Appel, G.J. Jacobson. 1988. The world's fastest Scrabble program, Comm. ACM 31(5), pp. 572-578 & 585.

Dister, Anne. 1998. Problématique des fins de phrase en traitement automatique du français. In À qui appartient la ponctuation ? Actes du colloque international et interdisciplinaire de Liège (13-15 mars 1997), pp. 437-447, Bruxelles : Duculot, Champs linguistiques.

Friburger, Nathalie; Dister, Anne; Maurel, Denis. 2000. Améliorer le découpage en phrases sous Intex. In Actes des troisièmes journées Intex, Liège, 13-14 juin 2000 (Anne Dister Ed.), in Revue, Informatique et Statistiques dans les sciences humaines 36 n°1-4, pp. 181-200.

Maurice Gross. 1989. The Use of Finite Automata in the Lexical Representation of Natural Language. In Electronic Dictionaries and Automata in Computational Linguistics, Lecture Notes in Computer Science 377, pp. 34-50, Berlin/New York: Springer.

Maurice Gross. 1997. The Construction of Local Grammars, in E.Roche et Y.Schabes (eds.), Finite-State Language Processing, Cambridge, Mass./London, The MIT Press, pp. 329-352.

Cláudio L. Lucchesi, Tomasz Kowaltowski. 1993. Applications of finite automata representing large vocabularies. Software - Practice and Experience 23(1), pp. 15-30, Wiley & Sons.

Sébastien Paumier. 2000. Nouvelles méthodes pour la recherche d'expressions dans de grands corpus. In Actes des troisièmes journées Intex, Liège, 13-14 juin 2000 (Anne Dister Ed.), in Revue, Informatique et Statistiques dans les sciences humaines 36 n°1-4, pp. 289-295.

Dominique Revuz. 1992. Minimization of acyclic deterministic automata in linear time. Theoretical Comput. Sci., vol. 92, n# 27 1, pp. 181-189.

Emmanuel Roche. 1992. Text disambiguation by finite-state automata: an algorithm and experiments on corpora. In COLING-92. Proceedings of the Conference, Nantes.

Emmanuel Roche. 1997. Parsing with finite state transducers. In E. Roche and Y. Schabes (eds.), Finite-State Language Processing, Cambridge, Mass./London, The MIT Press, pp. 241-281.

Max Silberztein. 1989. The lexical analysis of French, in Electronic Dictionaries and Automata in Computational Linguistics, Lectures Notes in Computer Science 377, Berlin/New York: Springer.

Max Silberztein. 1991. A new approach to tagging: the use of a large-coverage electronic dictionary, Applied Computer Translation 1(4).

Max Silberztein. 1992. Finite state descriptions of various levels of linguistic phenomena, Language Research 28(4), Seoul National University, pp. 731-748.

Max Silberztein. 1993. Dictionnaires électroniques et analyse automatique de textes. Le système INTEX, Paris, Masson, 234 p.

Max Silberztein. 1994. INTEX: a corpus processing system, in COLING 94 Proceedings, Kyoto, Japan.

Max Silberztein. 1997. The Lexical Analysis of Natural Languages, in Finite-State Language Processing, E. Roche and Y. Schabes (eds.), Cambridge, Mass./London, MIT Press, pp. 175-203.

Last modification on this page: March 27, 2016

University Paris-Est Marne-la-Vallée    IGM    Our NLP team Unitex/GramLab forum