Recap: Workshop on Mining Nasjonalbiblioteket
Last week, special guest Lars Johnsen gave a workshop for the DHNetwork on how to use data the National Library of Norway as a source for computational analysis. The varied groups of participants from the humanities, social sciences and university library learned to use different tools to analyze textual data in Jupyter Notebooks.
Lars Johnsen starts out explaining the 2016 mass digitization project of Nasjonalbiblioteket that we will be working with: 500.000 books and 1.5 million newspapers have been digitized and online access is open to everyone in Norway. Although the digitized material can be read online, the material is copyrighted and can therefore not be downloaded. Luckily, Johnson points out, Digital Humanities scholars do not want to read the texts, they want to have certain properties of the texts for analysis. And this can be done on the website of Nasjonalbiblioteket itself and in Jupyter Notebooks. For example, this N-gram function shows the distribution of word use throughout history, revealing historical changes such as countries changing names and linguistic changes such as “paa” being replaced by “på”. Sometimes words stay the same but their meaning shifts, which can be researched via the list of occurrences in texts.
After this demonstration, the hands-on part of the workshop starts. Everyone is invited to either download this zip file or to open the folder here in Binder. The folder contains a set of tutorials that continue to be available for people who want to try and practice with applying different tools to the data. We play with the concordance tool in the command-based interface, by first building a base corpus by determining the number of works, language, and other requirements, and then applying concordance commands such as Nb.word_paradigm(‘vite’), which finds ‘vite’ in all its grammatical forms. The result, then, shows the occurrences in combination with their immediate context.
Afterwards, we dip into the text analysis tool. Instead of a large corpus, here we choose one specific book to analyze. We create visualizations of that clusters words together based on their appearances in the text. The data on occurrences of words can be downloaded as a csv as well for further analysis outside notebooks.
Johnson concludes the workshop by explaining that the development of the program always means wearing three hats: he is a programmer, data analyzer, and user at the same time. Different parts of the program work for different people and it is important to look at the research community for further development. So take a look using the links above and if you have specific wishes for new tools, contact Johnson!