Unitex/GramLab



LANGUAGE RESOURCES

The RELEX network

Unitex works with dictionaries built by the members of the RELEX network, an international network of laboratories specialized in Computational Linguistics that was created by Maurice Gross and his LADL team. Most of the universities listed in the link page are members of this network.

Members of the RELEX network have built and are building exhaustive dictionaries of simple and compound words for French, English, Greek, Portuguese, Russian, Thai, Korean, Italian, Spanish, Norwegian, Arabic, German, Polish and more. They also build lexicon-grammar tables.

For more information about RELEX resources, please consult our bibliography.

DELAF dictionaries

All the dictionaries conform to the DELAF formalism. A DELAF dictionary is a text file, each line representing an entry. The line representing a word contains the inflected form of the word, the lemma of the word and some grammatical, semantic and inflectional information. Here is a sample of the English simple word dictionary:

acidize,.V:W:P1s:P2s:P1p:P2p:P3p
acidizing,acidize.V:G
acidized,acidize.V:K:I1s:I2s:I3s:I1p:I2p:I3p
acidizes,acidize.V:P3s
acidizer,.N+Conc:s
acidizers,acidizer.N+Conc:p
acidizing,.N:s
acidizings,acidizing.N:p
acidly,.ADV
acidness,.N:s
acidnesses,acidness.N:p
acidoid,.A
acidolysis,.N:s
acidolyses,acidolysis.N:p
acidometer,.N+Conc:s

DELAF dictionaries can contain both simple and compound words. Here is a sample of the English compound word dictionary:

Chamber of Commerce,.N+NPN+z1:s
Chamber of Deputies,.N+NPN+z1:s
Chamber of Horrors,.N+NPN+z1:s
Chamber of trade,.N+NPN+z1:s
Chambers of Commerce,Chamber of Commerce.N+NPN+z1:p
Chambers of Deputies,Chamber of Deputies.N+NPN+z1:p
Chambers of Horrors,Chamber of Horrors.N+NPN+z1:p
Chambers of trade,Chamber of trade.N+NPN+z1:p
Champagne Charley,.N+XN+z1:s
Chancellor of The Exchequer,.N+NPN+Hum+z1:s
Channel Fleet,.N+XN+z1:s
Channel Fleets,Channel Fleet.N+XN+z1:p
Channel Islander,.N+XN+Hum+z1:s
Channel Islanders,Channel Islander.N+XN+Hum+z1:p

To get more information about the DELAF formalism, please consult the manual.


Lexicon-Grammar tables

The lexicon-grammar methodology was developed by Maurice Gross, according to the following principle: every verb has a specific set of arguments (i.e. subject and complements), to the point that this set is often unique. Hence, the syntactic properties of verbs, or rather of the elementary sentences defined for each verb, have to be systematically described. No system predicting sentence forms from semantic features could exist. The systematic description consists in matrices whose rows are verbs (i.e. elementary sentences) and columns are sentence forms into which verbs may enter or not. The sentence forms are the usual transformations of elementary sentences, often simple declarative forms. Matrices are binary: a "+" sign appears at the intersection of a given row and a given column when the verb in the row enters the structure represented in the given column, a minus sign appears in the opposite situation.

A lexicon of the 12,000 main verbs of French has been subdivided into about 50 classes (C. Leclère 1991). Each class has a particular matrix. The sentence forms number about 400, including pronominalisation, passivization, sentential complement reductions, and nominalizations with support verbs.

A lexicon of 25,000 elementary sentences with at least one frozen argument is also available. Their representation by binary matrices follows the same principles. Partial lexicons of sentences with support verbs (être, avoir, faire, etc.) and predicative nouns have also been built (J. Giry-Schneider 1978, 1987, A. Meunier 1977).


Resources distributed with Unitex

The resources included in Unitex are distributed under the LGPLLR license. According to this license, you can obtain readable versions of these resources. You can download them for English and French here. You can also use the Uncompress program included in Unitex>=2.1 to get the text version of binary dictionaries distributed with Unitex.

The latest Unitex package contains resources for many languages. Here is a brief presentation of these resources. THESE RESOURCES ARE NOT THE WHOLE DICTIONARIES. Please follow the links for more information.

  • ENGLISH:

    • Dictionaries: 296,606 simple words (150,145 distinct lemmas) and 132,990 compound words (69,912 distinct lemmas)

    • Corpus : Ivanhoe, by Sir Walter Scott (courtesy of Jim Manis)
  • FINNISH:

  • FRENCH:

    • Dictionaries: 683,824 simple words (102,073 distinct lemmas), 108,436 compound words (83,604 distinct lemmas), given name dictionaries (24,000 entries) and a profession dictionary (4,200 entries), and 2,700 Quebec simple words

    • Corpus : Le tour du monde en 80 jours, by Jules Verne
  • GEORGIAN (Ancient):

    • Dictionaries: 7.254 simple words

    • Corpus : Isaac of Nineveh (Isaacus Ninivita), first collection, unpublished old Georgian text of two different translations, old translation (IXs) and new translation (XIs) — 25.900 words; 7.180 forms.

    • More information in the "Apply Lexical Resources" windows of Unitex
  • GERMAN:

    • Dictionaries: about 10% of the CISLEX developped at CIS, i.e. 300.000 word forms. There are also additional dictionaries, e.g. for numerals.

    • Corpus : Franz Kafka's "Proceß"
  • GREEK (Ancient):

    • Dictionaries: 280,733 simple forms (april 2006)

    • Corpus : Gregory of Nazianzus, Discourses X and XII (IVe s. PCN). Migne's Patrologia Graeca, vol. 35, col. 828-832; 844-839 (1.905 words)

    • More information at http://tpg.fltr.ucl.ac.be or in the "Apply Lexical Resources" windows of Unitex
  • GREEK (Modern):

    • Dictionaries: 360,000 simple words and 40,000 compound words (these resources represent about 30% of the whole dictionaries)

    • Corpus : journalistic corpus
  • ITALIAN:

    • Dictionaries: 118,000 simple words in DELAF and 32,000 compound words DELACF. There are also 2 simple word dictionaries that include 630 toponyms and 3255 proper names, and 2 compound word dictionaries that include 223 toponyms and 889 proper names.

    • Corpus : I Malavoglia, by Giovanni Verga
  • LATIN:

    • Dictionaries: 720,000 simple words in DELAF (Charlton Lewis, Charles Short, 1879), made available by the Perseus Project.

    • Corpus : De Bello Gallico, by Julius Caesar, made available by the Gutenberg Project.
  • MALAGASY:

    • Dictionaries: 536 simple verbs in DEMA-VS; 53 invariable words in DEMA-INVflx.

    • Corpus : Diwersy, Sascha (2009-), Corpus journalistique du malgache contemporain, Romance Philology Department, University of Cologne.
  • NORWEGIAN:

    • Dictionaries: 51,000 simple words and 640 compound words

    • Corpus : Folkeeventyr
  • PORTUGUESE:

    • Dictionaries: 940,000 simple words and 11,000 compound words (Portugal) and 880,000 simple words and 4,100 compound words (Brazil)

    • Corpus : Os Pobres, by Raul Brandão (Portugal); A Senhora, by José Manuel de Alencar (Brazil)
  • RUSSIAN:

    • Dictionaries: 9,800 entries (260,000 distinct forms) are included in Unitex.

      The whole lexicon contains:

      140,000 simple entries (= 2,7 millions distinct forms)
      160,000 proper nouns (= 840,000 distinct forms)
      500 compound words

    • Corpus : The Gambler, by Fiodor Dostoyevsky
  • SERBIAN (Latin and Cyrillic script):

    • Dictionaries: 9510 simple word forms, 66 compound word forms.

      The whole dictionary contains:

      130,000 simple entries
      10,500 sompound words

    • Corpus : the Serbian translation of Voltaire's Candide
  • SPANISH:

    • Dictionaries: 638,000 simple words

    • Corpus : Trafalgar, by Benito Pérez Galdós
  • THAI:

    • Dictionaries: 33,000 simple and 100 compound words

    • Corpus : extract from the novel Si Phan Din



Last modification on this page: December 03, 2013

University Paris-Est Marne-la-Vallée    IGM    Our NLP team Unitex/Gramlab forum