Using the the N highest probability tokens for each topic, calculate the topic coherence for each topic

topic_coherence(topic_model, dtm_data, top_n_tokens = 10, smoothing_beta = 1)

Arguments

topic_model

a fitted topic model object from one of the following: tm-class

dtm_data

a document-term matrix of token counts coercible to simple_triplet_matrix

top_n_tokens

an integer indicating the number of top words to consider, the default is 10

smoothing_beta

a numeric indicating the value to use to smooth the document frequencies in order avoid log zero issues, the default is 1

Value

A vector of topic coherence scores with length equal to the number of topics in the fitted model

References

Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011, July). "Optimizing semantic coherence in topic models." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 262-272). Association for Computational Linguistics. Chicago McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit." https://mallet.cs.umass.edu 2002.

Examples


# Using the example from the LDA function
library(topicmodels)
data("AssociatedPress", package = "topicmodels")
lda <- LDA(AssociatedPress[1:20,], control = list(alpha = 0.1), k = 2)
topic_coherence(lda, AssociatedPress[1:20,])
#> [1] -33.72954 -31.55650