Gyri Smørdal Losnegaard

Multiword Expressions in Norwegian

In this project I will work with the automatic recognition, analysis and classification of Norwegian multiword expressions (MWEs) and their representation within the Lexical Functional Grammar (LFG) framework. The principal motivation behind the project is the need for a description of methodically compiled multiword units in Norwegian, and the overall aim is to increase knowledge about Norwegian MWEs and, as far as possible, to make them available for research, development and teaching purposes.

MWEs are lexical units that exceed word boundaries, e.g. idioms ('gå den veien høna sparker' the goose is cooked), complex verbs ('finne ut av [noe]' work [something] out), fixed expressions ('som regel' normally; 'lys levende' alive and kicking; 'til syvende og sist ' ultimately, at the end of the day), conventions and collocations ('takk for sist' when you meet someone again; 'salt og pepper' but seldom 'pepper og salt'; 'holde en tale' give a speech but not 'gjøre en tale').

A complex phenomenon, the definition and classification of MWEs will depend on linguistic theory, disciplinary perspective, and on purpose. However, one common denominator is the anomaly factor—MWEs are exceptions to the regular properties and structures of a language. We may thus informally define them as multiword units that deviate from other combinations of words lexically, syntactically, semantically, pragmatically and/or statistically. Linguistically deviant MWEs are often referred to as lexicalized MWEs, while statistically significant co-occurrences of words are called institutionalized MWEs. Other related concepts are idiomaticity, metaphor, semantic transparency, literal vs. figurative meaning, semantic and syntactic non-compositionality, co-occurrence and convention.

Recent work suggests that MWEs are just as common as single words in the mental lexicon of the language user. This, along with their unpredictability, makes them a particular challenge in second language acquisition: MWEs cannot be inferred from the regular lexical and grammatical rules of a given language, they have to be learned.  Who would guess that 'gå bort' means die, while 'gå seg bort' means get lost (not in he 'dra dit pepper'n gror', but in the unable to find one’s way sense)? There is also a high degree of cross-lingual asymmetry: a MWE in one language may correspond to a lexically unrelated MWE in another language (‘frisk som en fisk’ fit as a fiddle), or perhaps to no MWE at all. Furthermore, MWEs may vary in terms of syntactic flexibility, being morphosyntactically restricted in one or more ways ('de tar rotta på han' but perhaps not 'han tas rotta på', 'som regel' but not 'som regelen'). With lexical properties on the one hand and morphosyntactic variation on the other, MWEs challenge the traditional distinction between lexicon and grammar.

For the same reasons, MWEs are a major bottleneck in natural language processing (NLP), which is the field of research concerned with computational treatment of human language. Unless NLP systems, such as grammar and spell checkers, dialogue systems, information retrieval, speech processing, and machine translation systems, are explicitly told that a particular sequence of words is to be analysed as one unit and not word by word, MWEs will often yield erroneous analyses or translations (sometimes with very amusing results) or even bring the automatic analyses to a halt.

MWEs are relevant to several disciplines: lexicography, terminology, philology, general linguistics and computational linguistics, to mention a few. However, lexical resources for Norwegian have so far either paid little attention to MWEs, or are devoid of information about them. This project is intended to mend this situation. Importantly, I will collect and account for Norwegian MWEs on a larger scale than has been possible prior to this project, building a database that will be available for educational and NLP purposes. In particular, this work will be a valuable contribution to the computational grammar NorGram, which is currently used to create a large database of linguistically annotated sentences for Norwegian in the INESS project. The project is also affiliated with the COST network PARSEME, which will coordinate nationally funded research on MWEs on a European level, concentrating on their role in deep parsing and its applications.

Smørdal Losnegaard works at the Department of Linguistic, Literary and Aesthetic Studies.