French (Fr)English (United Kingdom)
News
Publications
Project introduction
Partners
Contact
 
Metricc
News

3 years research grant proposal

Title : Text mining and web crawling dedicated to the construction of comparable corpora

Contexte
The VALORIA research group (http://www-valoria.univ-ubs.fr/) associated to the LICORN team of the HCTI research group(http://web.univ-ubs.fr/corpus/publi.html) has currently a vacancy for a 3-year PhD research grant in computer science, funded by the ANR METRICC project (http://www.metricc.com/). This project is situated at the interface of web technologies and Natural Language Processing.

The research work centres on the collecting of documents on the web with a view to reducing the cost of constructing comparable corpora. This task requires the development of oriented thematic crawling that is, for example, able to decide whether or not crawling a hyperlink is profitable. It will be necessary to test and evaluate the drift of the crawled documents with respect to pre-defined comparability criteria. The thematic could be defined either by means of comparable lexical cartographies (as defined by J. Veronis), aligned thesaurus or ontologies, or a set of aligned documents.

It will be necessary to propose lexicographic analyses on the crawled set of documents in order to tackle potential semantic drift due to context change. It is suggested to develop detailed collocational and colligational analysis for some key concepts so as to detect whether equivalent concepts can be extracted in translation and, if so, if whether these translated concepts are stable with respect to the lexical environment. This would allow comparisons to be drawn between the structure of a natural ontology and one constructed.

Contacts: VALORIA : Pierre-François Marteau, Gildas Ménier, Jeanne Villaneau
Contact: LICORN : Geoffrey Williams
Mail: :  firstname.name@univ-ubs.f

 

Last Updated on Wednesday, 10 February 2010 16:58