Home
Research Group Language Models and Resources (LaMoRe)
LingPhil Research Course

Research Course on Data Search and Preparation

This research course aims at providing PhD researchers with competence and skills to use language data in their research. The emphasis is on search tools and on basic manipulation of plain text and tabular data. The course is offered with support from the National Research School on Linguistics and Philology, in cooperation with CLARINO.

Language data
Photo:
© Koenraad De Smedt

Main content

Why this course?

Raw or annotated language data from corpora, archives, interviews or other sources often play an important role in linguistic research. This course aims at providing early stage researchers with an introduction to various methods and tools for searching, manipulating, applying and citing data.

Who can participate?

PhD candidates or postdoctoral researchers currently enrolled or employed at a university are eligible for this course. It is primarily targeted at linguistics and language studies, but may also be of interest for other humanities disciplinies working with language data.

How many credits does the course offer?

The course is not registered as offering credit but participants who attend the full course and complete the assignment will be given a certificate stating that they have performed work equivalent to about 25 to 30 study hours. There will be nonobligatory exercises during the course.

What is the content and who are the teachers?

  1. Finding data through catalogs; metadata, usage rights and licenses; citing data. Simple manipulation of plain text and tabular data (Koenraad De Smedt, UiB). Starting from plain text, it is easy to perform simple operations such as counting tokens and types. In tables, columns and rows can be counted, extracted, etc. An introduction to basics of R scripting and Python will be given. Jupyter Notebooks will be used.
  2. Searching in corpora with Glossa (Anders Nøklestad and Joel Priestly, UiO). Glossa will be used to search for simple phrases, grammatical expressions and corpus-specific annotation, and to explore the search results through concordances (search words in context), frequency lists, audio and video clips, and distributions in geographical maps. Searches can be restricted to certain groups of texts or speakers via metadata selections, and pseudo-random subsets of results can be extracted to more easily handle very frequent phenomena.
  3. Searching in Corpuscle and INESS (Paul Meurer, UiB). Corpuscle will be used to search for words or phrases and corpus-specific annotation, and to explore corpora through concordances, collocations, word lists with frequencies and distributions. INESS will be used to search for syntactic patterns in dependency and LFG treebanks.
  4. Accessing the National Library (Lars Johnsen, NB). While a lot of  texts at the library are available for online reading, the APIs give full access to concordances and collocations, frequency counts and various text statistics for individual texts, covering books, newspapers and periodicals. We will use Jupyter Notebooks for accessing the APIs.

What are the teaching methods and necessary preparations?

Most of the teaching and exercises require only a web browser. The Jupyter Notebooks can be run in cloud environments (Binder or Google Colaboratory) or locally if you have installed Visual Studio Code, for instance, and Python3. Installing R and RStudio is also recommended. The course may be given in English or Norwegian, depending on the language competencies of the participants. Most teaching materials will be in English, but Norwegian examples will also be given. We will try to stream the course for people who cannot attend in person. Additional information will be provided to registered participants.