Quantifying diversity of language phenomena: Case study of multiword expressions

  • Research teams:
    • ILES (Written and Sign Language Processing) of the LISN
    • BdTln (Data Bases and Natural Language Processing) of the LIFAT
  • Supervisors:
  • Duration : 5-6 months
  • Location : Blois campus of the University of Tours
  • Application deadline: December 8, 2022 (or until filled)
  • Position starts: February-March 2022
  • Position ends: July-August 2023
  • Remuneration: about 610 EUR/month (netto)
  • A PhD position, co-supervised by the same team, is planned to start in October 2023 at the LISN lab (Paris-Saclay University) on an extended version of this topic. A successful internship student will be Invited to apply.
  • This position is funded by the ANR SELEXINI project.

Motivation and context

Diversity of naturally occurring phenomena is a vital heritage to be preserved in the current progress- and optimization-driven globalization era. Diversity has been quantified in many domains: ecology, economy, information science, etc. but less so in Natural Language Processing (NLP).

Recently, we have been addressing this aspect with respect to a particular linguistic phenomenon: the one of multiword expressions (MWEs). MWEs, such as casser sa pipe ‘to die’ (literally ‘to break one’s pipe’) or sortir du lot ‘to be better than others’ (literally ‘to quit the batch’), are groups of words which exhibit unexpected properties (Baldwin & Kim, 2010; Constant et al. 2017). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. Language resources dedicated to MWEs include MWE lexicons and MWE-annotated corpora (Savary et al., 2017), while a major computational task is to automatically identify MWEs in running text. The PARSEME network has been addressing the MWE identification task via a series of shared tasks on automatic identification of verbal MWEs (Ramisch et al. 2020).

MWEs, like most other phenomena in human language, follow the so-called Zipf’s law (Williams et al. 2015): few items are frequent and there is a long tail of rare ones. Current models, including those for MWE identification, often favor the former and underperform in the latter. Hence, quality is overestimated and diversity is weakly accounted for. To meet this challenge, our recent work (Lion-Bouton, 2021; Lion-Bouton et al. 2022) is explicitly dedicated to quantifying diversity in MWE language resources and MWE identification systems. We have adapted measures of variety (number of types in a system), balance (equity of items in various types) and disparity (differences between types), stemming notably from ecology and information theory (Morales 2021).


The objective of this internship is to extend the formalisation of the diversity by benefiting from Good-Turing frequency estimation. Successfully used to estimate the biomass, Good-Turing frequency estimation is a statistical technique for estimating the probability of encountering an object of an unseen species, given a set of past observations of objects from different species (Good, 1953). Under this same principle, the idea would be to estimate the number of unseen MWEs from the MWEs observed in the corpus. Thus, it will be possible to correct the diversity measures to take the unseen MWEs into account and to evaluate the possible selection bias of the corpus.


  1. Study the state of the art on diversity measures and on the frequency estimators
  2. Adapt the application of the frequency estimator to a textual corpus
  3. Carry out an experimental study on MWE language resources to compare the results with our recent work (Lion-Bouton, 2021; Lion-Bouton et al. 2022)

Candidate’s profile

  • MSc graduate in computer science, computational linguistics or alike
  • Interests in linguistics and familiarity with language technology
  • Good programming skills
  • Good command of English both spoken and written
  • Capacity to work independently and as team member

How to apply

Send your CV and a transcript of Bachelor and Master grades to arnaud dot soulet at univ-tours dot fr, agata dot savary at universite-paris-saclay dot fr and thomas dot lavergne at universite-paris-saclay dot fr.


  • Baldwin, T. and Kim, S. N. (2010) Multiword Expressions, in Nitin Indurkhya and Fred J. Damerau (eds.) Handbook of Natural Language Processing, Second Edition, CRC Press, Boca Raton, USA, pp. 267-292.
  • Matthieu Constant, Gülşen Eryiğit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. 2017. Multiword expression processing: A survey. Computational Linguistics, 43(4):837–892.
  • Irving j. Good (1953). “The population frequencies of species and the estimation of population parameters”. Biometrika. 40 (3–4): 237–264. doi:10.1093/biomet/40.3-4.237. JSTOR 2333344. MR 0061330.
  • Adam Lion-Bouton (2021) Multi-criterion optimisation for multiword expression lexicon design promoting linguistic diversity, Technical report, University of Tours.
  • Adam Lion-Bouton, Yagmur Ozturk, Agata Savary and Jean-Yves Antoine (2022) Evaluating Diversity of Multiword Expressions in Annotated Text, In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022), Gyeongju, Republic of Korea.
  • Morales P. L., Lamarche-Perrin R., Fournier-S’Niehotta R., Poulain R., Tabourier L., Tarissan F. (2021) Measuring Diversity in Heterogeneous Information Networks, in Theoretical Computer Science, Elsevier.
  • Carlos Ramisch, Agata Savary, Bruno Guillaume, Jakub Waszczuk, Marie Candito, Ashwini Vaidya, Verginica Barbu Mititelu, Archna Bhatia, Uxoa Iñurrieta, Voula Giouli, Tunga Güngör, Menghan Jiang, Timm Lichte, Chaya Liebeskind, Johanna Monti, Sara Stymne, Abigail Walsh, Renata Ramisch, Hongzhi Xu (2020) Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions, in the Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons (MWE-LEX 2020), 13 December 2020, Barcelona, Spain (online).
  • Agata Savary, Marie Candito, Verginica Barbu Mititelu, Eduard Bejček, Fabienne Cap, Slavomir Čéplö, Silvio Ricardo Cordeiro, Gülşen Eryiğit, Voula Giouli, Maarten van Gompel, Yaakov HaCohen-Kerner, Jolanta Kovalevskaitė, Simon Krek, Chaya Liebeskind, Johanna Monti, Carla Parra Escartín, Lonneke van der Plas, Behrang QasemiZadeh, Carlos Ramisch, Federico Sangati, Ivelina Stoyanova, Veronika Vincze (2018) PARSEME multilingual corpus of verbal multiword expressions, in Stella Markantonatou, Carlos Ramisch, Agata Savary, Veronika Vincze (Eds.) “Multiword expressions at length and in depth: Extended papers from the MWE 2017 workshop”, Language Science Press, Berlin, pp. 87-147.
  • Williams J. R., Lessard P. R., Desu S., Clark E. M., Bagrow J. P., Danforth C. M., Dodds P. S. (2015). Zipf’s law holds for phrases, not words. Scientific Reports, 5.