(this page last update) October 13 2009

What is it ? 

NLGbAse is a set of graphes, metadatas and resources files devoted to Natural Language Generation and Natural Language Understanding components of information systems. Metadatas are built  from large encyclopedic, evolutive and multilingual corpora like Wikipedia or other wiki datas. Main advantage of such a learning material is its evolutivity. NLGbAse can learn automaticaly new entities and relations on a day by day basis.  Because of cross-linguistic detailed references, NLGbAse gives for a unique term, a wide range of possible writings and synonyms.
*You can explore the metadatas content using the Database View links.

NLGbAse
is also a set of software application using Metadatas to various NLG and NLU taks. Ie a labelling named entity task.
*You can use the entity labeling  and search and query tool to experiment it.

Basic principle:

- you want to tag a micro-processor name but it has a lot of possible writings. This graph of AMD Athlon 64 term demonstrates the range of available writing forms collected by our base. As acronyms can be the same in many languages, this example shows how a cross-linguistic reference can improve tagging in a specific language.
- you need a wide range of queries to increase the coverage of a touristic search engine. This entry of Saoudy Arabia gives lot of possible writing forms.

What's in metadata's? 

We propose a database of {3 422 975 (*)} graph sets, representing more than {14 millions possible term writings**} and algorithm and tools to extract informations from this base :
* Each set is a term with all is  possibilities of writing  in multiple languages (at the moment, French, English,  Italian, Spanish and German). 
* Each set has a named entitity tag (Enamex standard) like "Location", "Person", "Place", "Date", "Organisation", or "Unknown" for encyclopedic terms. This represents(***):

  • 177 858 persons in French,  683 942 in English, 85 538 in Spanish
  • 145 982 places in French,  519 938 in English,  92 185 in Spanish
  • 74 417 organisations in French,  314 551 in english,  56 249 in Spanish
  • 86579 products in French, 420 715 in English,  50 415 in Spanish

* Each set is associated to a list of "potential contextual words" to allow named entity disambiguation.
(*) 600 000 in French, 2 445 731 in English, 377 244 in Spanish [3 sept 08]
(**) 1 900 000 in French 1 342 000, Spanish, more than 10m in English [3 sept 08]
(***) Evaluated with set generated on 2008-09-04. Somes entities can be identical in 2 or more languages.

http://www.nlgbase.org/fr_sample.png
Sample of conceptual graph for a [person] named entity. This graph of a songer pseudonym (Akhenaton) gives also his real name (Philippe Fragione). Used in a music search engine, it can be helpfull to increase the coverage of a query. 
Graphe representation
Sample of conceptual graphe for an encyclopedic entity

Applications 

Those database of graph sets can be used for robust Named Entity tagging, enrichement of queries in search engines, translanguage search query, semantic queries, named entity tagging, and NLG applications. 

A website is now available to experiment the demonstration search engine. Please visit www.nlgbase.net or try the experimental modules here.

Have you enjoyed it ?

All the tools delivered on this website are free to use for academic and research purpose. If you find those tools usefull, please cite published  papers related to this system or this website:

@misc{NLGbAse,
author = {Eric Charton},
title = {{NLGbAse, the Wikipedia based statistical ontology.}},
note = "Wikipedia as a based statistical ontology, is an effort to extract structured information from Wikipedia in 5 linguistic versions and to make this information available on the Web."
organization = " Laboratoire Informatique d'Avignon, Université d'Avignon et des Pays de Vaucluse",
howpublished = "\url{http://www.nlgbase.org}"
}