DIGSSCORE Tuesday seminar

Mikael Johannesson: The amount of text for topic models

Picture of Mikael Johannesson
Mikael Johannesson

Mikael Johannesson, PhD candidate at the Department of Comparative Politics and DIGSSCORE, will present today on topic models, allowing researchers to make meaningful measurements of text.

Topic models such as the Latent Dirichlet Allocation (LDA) and the Structural Topic Model (STM) allows researchers to make meaningful measurements of text. They were developed with fairly many and long texts in mind, but are frequently used with quite a few and short ones in practice. A typical example is open-ended survey responses. The implications of using such short or few texts have received hardly any attention. Here, we fill this gap by showing what happens to topic models when either the amount of documents, or their length, decreases. For several published applications where the amount and length of text were ample, we re-fit the original topic model, thousands of times, on smaller subsamples of the original corpus. We do this using a full-factorial experimental replication design: Both the number of sampled documents and the proportion of sampled words within documents, as well as the number of topics and other model parameters, were randomized. We then interactively aligned the resulting topics with the topics in the original model in order to assess stability. In addition, we measure the quality of the topics using measures of semantic coherence and exclusivity, as well as the sensitivity of the inference made with them in the original paper. The results show that topic models can perform fairly well even with little text, but it does cause systematic shifts in definitions and removal of certain topics. Given the accelerating use of these models, this paper makes an important contribution by highlighting pitfalls and presenting advice for when working smaller corpuses'.


Lunch is served at first come, first served basis.