Contribution of semantic lexicons to diversity coverage in NLP tasks

Post-doctoral position, 2024-2025

  • Topic: Domain : natural language processing (NLP)
  • Location: Université Paris-Saclay (LISN lab), Gif-sur-Yvette, France; with visits to the ATILF laboratory in Nancy
  • Research teams: STL (Linguistics and Language Technology) of the LISN ; MRI (Modelling, resources and automatic processing) of the ATILF
  • Supervisors:
  • Funding: ANR SELEXINI project
  • Duration: 18-20 months
  • Salary: depends on the professional experience of the candidate; for less than 2 years of experience: from 2992 EUR gross (about 2405 EUR net) and 3404 EUR gross (about 2736 EUR net)

Motivation and context

Diversity of naturally occurring phenomena is a vital heritage to be preserved in the current progress- and optimization-driven globalisation era. Diversity has been quantified in many domains: ecology, economy, information science, etc. but less so in linguistics and natural language language processing (NLP). Most linguistic phenomena follow Zipf’s law, i.e. few items are frequent and there is a long tail of rare ones. These few frequent items tend to be less diverse than the numerous items in the “Zipfian tail”. Current models often favour the former and underperform in the latter, as they heavily rely on annotated data and are tuned for optimal global performances on biased benchmarks. Hence, quality is overestimated while generalisation and robustness are rarely assessed (Wisniewski & Yvon 2019). Although awareness about diversity is rising (Narayan & Cohen 2015; Yang et al. 2018; Palumbo et al. 2020), diversity is still largely neglected to build and evaluate NLP systems.

This position is part of the ANR SELEXINI project, in which the objective is to develop weakly supervised methods to induce semantic lexicons which will then be seamlessly integrated with neural text processing models. The lexicon will contain both continuous and symbolic representations, will cover both single (e.g voler) and multiword (e.g voler au secours) entries and will inform about semantic arguments of these entries, their lexical senses and their semantic frames. Since such lexicons will be induced from very large non-annotated corpora, they are expected to contain more diverse information than annotated corpora, which are necessarily much smaller. Thus, lexicons, when used with supervised methods, might help increase the account of diversity in such methods.

A sample NLP task which might benefit from this effect is automatic identification of multiword expressions (MWEs). MWEs, such as casser sa pipe ‘to die’ (literally to break one’s pipe) or sortir du lot ‘to be better than others’ (literally to quit the batch), are groups of words which exhibit unpredicted properties (Baldwin & Kim, 2010). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. Language resources dedicated to MWEs include MWE lexicons and MWE-annotated corpora (Savary et al., 2023), while a major computational task is to automatically identify MWEs in running text. The PARSEME network has been addressing the MWE identification task via a series of shared tasks on automatic identification of verbal MWEs (Ramisch et al. 2020). MWEs, like most other phenomena in human language, follow the so-called Zipf’s law (Williams et al. 2015): few items are frequent and there is a long tail of rare ones. Current models, including those for MWE identification, often favour the former and underperform in the latter. Hence, quality is overestimated and diversity is weakly accounted for.

To meet this challenge, our recent work (Lion-Bouton et al. 2022, Savary et al. 2023) is explicitly dedicated to quantifying diversity in MWE language resources and MWE identification systems. We have adapted measures of variety (number of types in a system), balance (equity of items in various types) and disparity (differences between types), stemming notably from ecology and information theory (Morales 2021).

Diversity quantification and its integration in evaluation of NLP tools is also one of major challenges addressed by the COST Action UniDive, an international scientific network devoted to universality, diversity and idiosyncrasy in language technology. The successful post-doctoral candidate will be integrated into this project (additionally to the ANR SELEXINI) and will notably benefit from the UniDive networking facilities (meetings, workshops, scientific missions, training schools, etc.).

Objectives

The objective of this PhD is to perform extrinsic evaluation of the semantic lexicon of French induced in the SELEXINI project in terms of diversity. More precisely, such a lexicon will be integrated into state-of-the-art MWE identification and the increase in terms of diversity of the results will be measured. Extensions of this work include applying and adapting this experiment to a variety of languages and of linguistic phenomena:

  • predicates, both simple and multiword, and their contexts of occurrence
  • semantic slots
  • semantic frames

Candidate’s profile

  • PhD in computer science, computational linguistics or alike
  • Skills in supervised and semi-supervised machine learning, including deep learning
  • Good command of English both spoken and written
  • Capacity to work independently and as team member

Important dates

  • Application deadline: Tuesday 30 Jan 2024
  • Interviews and notification: 5 February 2024
  • Position starts: 1 April 2024
  • Position ends: September-November 2025

How to apply

Submit your CV, a cover letter, reviews of your PhD thesis, and a letter of reference (e.g. from your PhD supervisor) via the CNRS portal: [https://emploi.cnrs.fr/Offres/CDD/UMR9015-AGASAV-001/Default.aspx?lang=EN]

References

  • Baldwin, T. and Kim, S. N. (2010) Multiword Expressions, in Nitin Indurkhya and Fred J. Damerau (eds.) Handbook of Natural Language Processing, Second Edition, CRC Press, Boca Raton, USA, pp. 267-292.
  • Matthieu Constant, Gülşen Eryiğit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. 2017. Multiword expression processing: A survey. Computational Linguistics, 43(4):837–892.
  • Adam Lion-Bouton, Yagmur Ozturk, Agata Savary and Jean-Yves Antoine (2022) Evaluating Diversity of Multiword Expressions in Annotated Text, In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022), Gyeongju, Republic of Korea.
  • Morales P. L., Lamarche-Perrin R., Fournier-S’niehotta R., Poulain R., Tabourier L., Tarissan F. (2021) Measuring Diversity in Heterogeneous Information Networks, in Theoretical Computer Science, Elsevier.
  • Narayan S., Cohen S. (2015) Diversity in Spectral Learning for Natural Language Parsing. In EMNLP 2015.
  • Palumbo E., Mezzalira A., Marco C., Manzotti A., Amberti D. (2020). Semantic Diversity for Natural Language Understanding Evaluation in Dialog Systems. In COLING 2020, pp. 44-49.
  • Carlos Ramisch, Agata Savary, Bruno Guillaume, Jakub Waszczuk, Marie Candito, Ashwini Vaidya, Verginica Barbu Mititelu, Archna Bhatia, Uxoa Iñurrieta, * Voula Giouli, Tunga Güngör, Menghan Jiang, Timm Lichte, Chaya Liebeskind, Johanna Monti, Sara Stymne, Abigail Walsh, Renata Ramisch, Hongzhi Xu (2020) Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions, in the Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons (MWE-LEX 2020), 13 December 2020, Barcelona, Spain (online).
  • Agata Savary, Cherifa Ben Khelil, Carlos Ramisch, Voula Giouli, Verginica Barbu Mititelu, Najet Hadj Mohamed, Cvetana Krstev, Chaya Liebeskind, Hongzhi Xu, Menghan Jiang, Sara Stymne, Tunga Güngör, Thomas Pickard, Bruno Guillaume, Archna Bhatia, Alexandra Butler, Marie Candito, Apolonija Gantar, Uxoa Iñurrieta, Albert Gatt, Jolanta Kovalevskaite, Simon Krek, Timm Lichte, Nikola Ljubešic, Johanna Monti, Carla Parra Escartín, Mehrnoush Shamsfard, Ivelina Stoyanova, Veronika Vincze, Abigail Walsh (2023) “PARSEME Corpus Release 1.3”, in the Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023), 6 May 2023, Dubrovnik, Croatia.
  • Savary, A., Öztürk, Y., Lion-Bouton, A. (2023) Quantifying intra-linguistic diversity: Case study of multiword expressions, in UniDive1st General Meeting, Paris-Saclay University, Orsay, France, 16-17 March 2023.
  • Williams J. R., Lessard P. R., Desu S., Clark E. M., Bagrow J. P., Danforth C. M., Dodds P. S. (2015). Zipf’s law holds for phrases, not words. Scientific Reports, 5.
  • Wisniewski G., Yvon F. (2019). How Bad are PoS Tagger in Cross-Corpora Settings? Evaluating Annotation Divergence in the UD Project. In NAACL 2019, pp. 218–227.
  • Yang Z., Qi P., Zhang S., Bengio Y., Cohen W., Salakhutdinov R. Manning C. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In EMNLP 2018.