(last update) January 11 2010

Graph Databases

The online tools proposed to make experiments, view or search records, are made of datas.

NLGbAse metadata

NLGbAse metadata can be downloaded here : http://www.nlgbase.org/base_stable/
*Each set is made of two files:
  • LE.data.csv
  • LE.tfidf-label.csv
 
Information about data files :
File LE.data.csv contain the lexical networks and class label. Files have the following structure (each csv record separated by \t have the following structure, LE is reference of linguistic edition):
 
internal_number    class_label_(according to ester rules)    name_key_in_reference_language  
( [name_0    name_x ] all writing formes in reference language ) ( [name:le_0    name_le_x ] all writing formes in [le] linguistic édition (ie en: -english, de: -german etc )

 
Sample:
243767  loc.admi   Alabama Alabama (U.S. state)    Alabama, United States  The Yellowhammer State  Alabam  Alabama (state) The Heart of Dixie      22nd State      Alabahmu        State of Alabama        US-AL   de:Alabama      Alabama (Bundesstaat)   fr:Alabama      État de l'Alabama       it:Alabama      es:Alabama      Alabama (estado)        Ala (homonym)
 
File LE.tfidf-label.csv contains the words and respectif tfidf weight. Files have the following structure (each csv record separated by \t have the following structure, LE is reference of linguistic edition):
 
internal_number (same as LE.data.csv file)   name_key_in_reference_language  class_label_(according to ester rules)   [ word_0:tfidf_weight   word_n:tfidf_weight ]
 

Metadata building and NER system implementation

Metadata files are described in following paper:
Classification d’un contenu encyclopédique en vue d’un étiquetage par entités nommées (Eric Charton and Juan Manuel Torres-Moreno - Laboratoire Informatique d'Avignon, Université d'Avignon et des Pays de Vaucluse)(pdf)
 
Implementation of NER system with cosine similarity algorithm is described in :
Combinaison de contenus encyclopédiques multilingues pour une reconnaissance d’entités nommées en contexte (Eric Charton - Laboratoire Informatique d'Avignon, Université d'Avignon et des Pays de Vaucluse) (pdf)

Implementation of NER system with CRF
Please note that the implementation of current NLGbAse NER system is not cosine similarity based. It's a CRF NER system implementation similar to the one used for the Ester 2 evaluation campaign. This system will be described in future papers (Icassp, accepted). Training corpus for CRF++ NER, based on NLGbAse metadatas will be released in march 2010.


Those data are freely available. For more information and any question feel free to send a mail to :
eric[put a dot here]charton@univ-avignon[put a dot here]fr

Please note that currently our Database format evoluates strongly and rapidly.