Graph Databases
The online tools proposed to make experiments, view or search records, are made of datas.NLGbAse metadata
NLGbAse metadata can be downloaded here : http://www.nlgbase.org/base_stable/*Each set is made of two files:
- LE.data.csv
- LE.tfidf-label.csv
Information about data files :
File LE.data.csv contain the lexical networks and class label. Files have the following structure (each csv record separated by \t have the following structure, LE is reference of linguistic edition):
internal_number class_label_(according to ester rules) name_key_in_reference_language
( [name_0 name_x ] all writing formes in reference language ) ( [name:le_0 name_le_x ] all writing formes in [le] linguistic édition (ie en: -english, de: -german etc )
Sample:
243767 loc.admi Alabama Alabama (U.S. state) Alabama, United States The Yellowhammer State Alabam Alabama (state) The Heart of Dixie 22nd State Alabahmu State of Alabama US-AL de:Alabama Alabama (Bundesstaat) fr:Alabama État de l'Alabama it:Alabama es:Alabama Alabama (estado) Ala (homonym)
File LE.tfidf-label.csv contains the words and respectif tfidf weight. Files have the following structure (each csv record separated by \t have the following structure, LE is reference of linguistic edition):
internal_number (same as LE.data.csv file) name_key_in_reference_language class_label_(according to ester rules) [ word_0:tfidf_weight word_n:tfidf_weight ]
Metadata building and NER system implementation
Metadata files are described in following paper:Classification d’un contenu encyclopédique en vue d’un étiquetage par entités nommées (Eric Charton and Juan Manuel Torres-Moreno - Laboratoire Informatique d'Avignon, Université d'Avignon et des Pays de Vaucluse)(pdf)
Implementation of NER system with cosine similarity algorithm is described in :
Combinaison de contenus encyclopédiques multilingues pour une reconnaissance d’entités nommées en contexte (Eric Charton - Laboratoire Informatique d'Avignon, Université d'Avignon et des Pays de Vaucluse) (pdf)
Implementation of NER system with CRF
Please note that the implementation of current NLGbAse NER system is not cosine similarity based. It's a CRF NER system implementation similar to the one used for the Ester 2 evaluation campaign. This system will be described in future papers (Icassp, accepted). Training corpus for CRF++ NER, based on NLGbAse metadatas will be released in march 2010.
Those data are freely available. For more information and any question feel free to send a mail to :
eric[put a dot here]charton@univ-avignon[put a dot here]fr