Much of biodiversity is discovered in museum collections, sometimes years after the specimen has been collected. Ploughing through expedition notes and logs is then required and therefore having a way to summarize the contents of a large text corpus can be very interesting. In this example, I will graphically summarize “On the Origin of Species” by Charles Darwin (it seemed a suitable choice) to demonstrate this technique.
These are the libraries I will be using:
library(dplyr) library(tidytext) library(tidyverse) library(topicmodels) library(RColorBrewer)
I obtained the text of the 1859 edition from Wikisource. The book has 14 chapters which I saved each in its own text file, labeled “chapter01.txt” to “chapter14.txt”. First, I loaded every chapter of the book and stored it in a data frame, with one row for each chapter and 3 columns: book, chapter and text.
#read a text from path x read <- function(x) {paste(readLines(x), collapse=" ")} #read origin of species origin.num.ch <- 14 origin <- character(origin.num.ch) chapter <- c(1:origin.num.ch) for (c in chapter) { path = paste0( "../darwin/origin/chapter", sprintf("%02d", c),".txt") origin[c] <- read(path) } origin.book <- data.frame( c("origin of species"), c(1:origin.num.ch), unlist(origin), stringsAsFactors = FALSE) colnames(origin.book ) <- c("book", "chapter", "text")
Next, cast the corpus to a document term matrix (dtm), that will be used as input for the topic model:
dtm <- origin.book %>% unnest_tokens(input="text", output="word") %>% anti_join(stop_words) %>% count(chapter, word) %>% cast_dtm(document=chapter, term=word, value=n)
Finding the best number of topics
In order to summarize the book, I have to know how many topics it deals with. This is an interesting problem, since for this book as for most texts, there will not be an obvious answer. I used the measure of “perplexity” provided by the “topicmodels” package.
I will try to fit a model, here Latent Dirichlet allocation (LDA) provided by the package “topicmodels”, to a number of topics from 2 to 50.
num.tries = 50
# k=1 doesn't make sense,
# so there are num.tries-1 entries
# in the vector holding perplexities
mod.per = numeric(num.tries-1)
for (i in 2:num.tries) {
mod <- LDA(
x=dtm,
k=i, # k is the number of topics.
method="Gibbs",
control=list(alpha=1, seed=10005))
mod.per[i-1] = perplexity(mod, dtm)
}
# store the result in a data frame for further use
mod.per.df <- data.frame(c(2:num.tries), mod.per)
colnames(mod.per.df) <- c("k", "perplexity")
After having fitted a LDA model for a number of topics between 2 and 50, and calculated its “perplexity” (which takes 20 minutes on my computer), I can now search for the smallest number of topics after which adding new topics does not decrease perplexity.
#set number of topics to #the first k with lower perplexity than all following k's min.p <- mod.per.df$perplexity[2] num.topics <- num.tries for(i in mod.per.df %>% pull(k)) { k.current <- mod.per.df %>% filter(k==i) %>% pull(perplexity) k.following <- max( mod.per.df %>% filter(k>i) %>% pull(perplexity) ) if (k.current < k.following) { num.topics <- i min.p <- k.current break } } num.topics
In this case, 12 topics seems appropriate. This can be seen on the following plot:
Fit a model, here Latent Dirichlet allocation (LDA) provided by the package “topicmodels”, using the best number of topics as the “k” parameter (here 12).
mod <- LDA( x=dtm, k=num.topics, method="Gibbs", control=list(alpha=1, seed=10005) )
The LDA model return two matrices. The “beta” matrix is the document-term matrix, which describes the words that characterize a topic. Each word is given a value (phi). For each topic, find the words in the document-term matrix where the phi is in the 99.9% quantile, to get a grasp of what each topic is about.
#grab the topic-term matrix beta <- tidy(mod, matrix="beta") #add a column with the upper 99.9% quantile of beta for each topic q <- beta %>% select("topic", "beta") %>% group_by(topic) %>% summarize(quants = quantile(beta, probs = c(0.999))) %>% mutate(quants = quants[[1]]) #list the terms with beta in the upper 99.9% quantile for each topic topics <- beta %>% select(c("topic", "term", "beta")) %>% group_by(topic) %>% arrange(topic, desc(beta)) %>% filter(beta > q[topic,]$quants)
I can now print the most characteristic words for each of the 12 topics:
topic.terms = c() for (t in 1:num.topics) { word_frequencies <- tidy(mod, matrix="beta") %>% mutate(n = trunc(beta * 10000)) %>% filter(topic == t) topic.string <- paste(topics %>% filter(topic == t) %>% arrange(term) %>% pull(term), collapse=", ") print(paste("Topic", t, ":", topic.string)) topic.terms[t] <- topic.string }
This gives the following topic characterizations:
Topic 1 : animals, distinct, forms, life, nature, species, varieties
Topic 2 : flower, natural, pollen, selection
Topic 3 : breed, breeds, domestic, pigeon, pigeons, selection, wild
Topic 4 : bees, cells, instinct, instincts, nest, wax
Topic 5 : crossed, crosses, fertile, fertility, hybrids, pollen, species, sterility, varieties
Topic 6 : habits, natural, organ, organs, selection, structure
Topic 7 : islands, plants
Topic 8 : increase, plants, struggle
Topic 9 : characters, selection, variable
Topic 10 : characters, classification, descent, organs, rudimentary
Topic 11 : doubtful, genera, larger, naturalists, ranked, species, varieties
Topic 12 : formation, formations, forms, geological, intermediate, species, time”
Now for each chapter, plotting the document-topic probabilities gives the diagram at the top of this post. To do this, I have to use the other matrix returned by the LDA model: the document-topic matrix “gamma”, and plot the chapters on the horizontal axis, the probability that a topic describes a chapter (which is also called “gamma” as is the matrix) is plotted as a color.
# set the size of the diagram options(repr.plot.width=15, repr.plot.height=8) # make a palette for the number of topics mycolors <- colorRampPalette(brewer.pal(8, "Dark2"))(num.topics) #grab the document-topic matrix gamma <- tidy(mod, matrix="gamma") ggplot(gamma, # has to be numeric to sort the chapters in the right order aes(as.factor(as.numeric(document)), gamma, # has to be factor to use a discrete color scale fill=as.factor(topic)) ) + geom_col() + scale_fill_manual(values = mycolors, labels = topic.terms) + scale_color_manual(values = mycolors) + theme_classic() + labs(x="Chapter", y="Document-topic probability", fill="Topic")