Useful Apps for Text Mining

Text mining is one of the most powerful tools in the modern historian's toolkit. Now that truly vast collections of digitized text are freely available, perusing them effectively requires more than simple reading. Here, I'll present a few tools that I and others have found useful for such tasks.

The steps in text mining are fairly straightforward. First, one must detect and remove inconsistencies in the data through pre-processing and cleaning. Then, it's necessary to convert all the relevant information extracted from unstructured data into structured formats. Afterward, one runs the data through one of the many text mining applications to extract usable information on the patterns exhibited within.

The first, and by far the most popular, of these applications is Google's Ngram Viewer. It allows users to plot line graphs of word usage over time. Here's an example for the word 'slave":

As one can see, usage of the word slave peaked in the early 1860s, around the time of the Civil War. This fact is perhaps unsurprising, but it still showcases the effect that historical events can have on word usage.

There are also abundant open-source NLP techniques for analyzing large text corpora. Among these, the one that seems most interesting is Apache OpenNLP, a machine-learning-based toolkit that allows scholars and laymen alike to carry out "language detection, tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, and parsing." Similarly, the tm package in R allows users to carry out all the same operations on a flexible, multipurpose platform. allows users to generate "clouds" that depict word frequencies in a given corpus as the sizes of the words. This allows researchers to examine patterns of word frequencies, which is similar to the Ngram viewer sans the temporal axis.

This is just an introduction to the many applications that exist for historians concerned with text mining. Many more exist. Each of them is a powerful tool for analyzing enormous quantities of historical texts. Without them, the tsunami of digitized archival material will be nigh impossible for professionals to digest.

