In-context definition generation models for a reading assistant software prototype

  • Contract duration: 36 months
  • Starting date: September or October 2023
  • Location: LIS laboratory, TALEP team, Marseille, France
  • Advisors: Carlos Ramisch and Alexis Nasr
  • Net salary: ~1650€ net/month (optional: bonus for 64h/year teaching)
  • Application: Application file should be sent before June 20, 2023 to the advisors first.last (AT) univ-amu.fr. It should comprise:
    • a CV including information about transcripts/grades, degrees, internships
    • a cover letter describing the candidate’s motivation and relevance for the project
    • the names and contacts of two referees

Pre-selected candidates will be interviewed (remotely) by the end of June 2023.

Motivation Reading a text in a foreign language may be difficult for non-native speakers and for language learners. Similarly, native speakers may struggle to read a text on a specialised (scientific, technical) topic. This happens because text comprehension involves accessing and combining the meanings of words or phrases, which may be unfamiliar to the reader, reducing or preventing them from understanding the text, and thus from learning a language or accessing a particular knowledge.

A reading assistant software could guide readers by providing information about hard words and phrases, notably dictionary-style definitions for user-selected text fragments. These definitions, written in simple words (possibly completed by grammar explanations, synonyms, examples, images, etc.) could guide the reading process, lifting the language barrier that prevents speakers and learners from getting access to the information, while improving their language skills in the process.

Goals The goal of this PhD project is to develop original definition retrieval and generation models and evaluate them in the context of a reading assistant prototype. The models must be able to retrieve and predict simple definitions for potentially ambiguous words and phrases in a text, potentially composed of several words, compositionally or idiomatically combined. The recruited person may develop cross-lingual or multimodal (text+image) models, depending on their interests.

Research questions The development of these models poses research questions that combine challenges from two traditional NLP tasks: word sense disambiguation (WSD), and natural language generation (NLG).

  1. If the target words or phrases are present in a dictionary, how can we (a) select those which are compatible with the context and (b) combine several definitions, e.g. redundant or of different granularities, into a single useful (simple, specific enough) definition?
  2. How can we obtain definitions for phrases composed of multiple words (e.g. w1 w2) from the definitions of their components (e.g. the definitions of w1 and those of w2)?
  3. How can we evaluate the models beyond traditional WSD and NLG protocols? In particular, how can we assess (a) the generalisation of the predicted definitions in low-resource scenarios, and (b) the usefulness of the generated definitions to end users (human readers)?

Methodology The definition generation methods will rely on open and reasonably sized pre-trained large language models such as BART (Lewis et al. 2020) and BLOOM (Le Scao et al. 2022). They will be used as such and/or fine-tuned on definitions, aiming at the desired definition characteristics (length, simplicity, context relevance). The models will be evaluated intrinsically using existing datasets such as Hei++ and extrinsically among readers. In both scenarios, new evaluation protocols will be designed to assess the generalisation of the models and the usefulness of the definitions.

Contextualisation Definition modelling, that is, the task of generating a definition for an in-context target text fragment, can be seen as a generative variant of WSD, and is a relatively recent task (Noraset et al. 2017). Pre-trained transformer-based language models fine-tuned on definitions were shown to be able to generate relevant disambiguated definitions for words and phrases (Bevilacqua et al. 2020). Specific models have been recently proposed, addressing the quality of the generated definitions (Huang et al. 2021) and the generation of definitions for in specialised scientific domains (August et al. 2022). Most of this work was evaluated on English only.

The PhD will be funded by French National Research Agency (ANR) via the SELEXINI project, which aims at inducing and evaluating semantic lexicons (word senses and frames) from raw text using pre-trained contextualised embeddings, explicit linguistic structure (parse trees, lemmas, etc), and weak supervision from Wiktionary. This PhD will develop definition prediction models relying on Wiktionary and on the semantic lexicon induction techniques currently developed by another PhD member of the project. The recruited person will interact with other project members, participate in project meetings, co-author project-related papers, etc.