One curious side-effect of the work to digitize books and historical texts is the ability to search these databases for words, when they first appeared and how their frequency of use has changed over time.
The Google Books n-gram corpus is a good example (an n-gram is a sequence of n words). Enter a word or phrase and it’ll show you its relative usage frequency since 1800. For example, the word “Frankenstein” first appeared in the late 1810s and has grown in popularity ever since.
By contrast, the phrase “Harry Potter” appeared in the late 1990s, gained quickly in popularity but never overtook Frankenstein — or Dracula, for that matter. That may be something of surprise given the unprecedented global popularity of J.K. Rowling’s teenage wizard.
And therein lies the problem with a database founded on an old-fashioned, paper-based technology. The Google Books corpus records “Harry Potter” once for each novel, article and text in which it appears, not for the millions of times it is printed and sold. There is no way to account for this level of fame or how it leaves others in the shade.
Today that changes, thanks to the work of Thayer Alshaabi at the Computational Story Lab at the University of Vermont and a number of colleagues. This team has created a searchable database of over 100 billion tweets in more than 150 languages containing over a trillion 1-grams, 2-grams and 3-grams. That’s about 10 per cent of all Twitter messages since September 2008.
The team has also developed a data visualization tool called Storywrangler that reveals the popularity of any words or phrases based on the number of times they have been tweeted and retweeted. The database shows how this popularity waxes and wanes over time.
“In building Storywrangler, our primary goal has been to curate and share a rich, language-based ecology of interconnected n-gram time series derived from Twitter,” say Alshaabi and co.
Storywrangler immediately reveals the “story” associated with a wide range of events, individuals and phenomenon. For example, it shows the annual popularity of words associated with religious festivals such as Christmas and Easter. It tells how phrases associated with new films burst into Twittersphere and then fade away, while TV series tend to live on, at least throughout the series’ lifetime. And it reveals the emergence of politico-social movements such as Brexit, Occupy #MeToo and Black Lives Matter.
The storylines can also be compared with other databases to provide more fine-grained insight and analysis. For example, the popularity of film titles on Twitter can be compared with the film’s takings at the box office; the emergence of words associated with disease can be compared with the number of infections recorded by official sources; and words associated with political unrest can be compared with incidents of civil disobedience.
That’s useful because this kind of analysis provides a new way to study society, potentially with predictive results. Indeed, computer scientists have long suggested that social media can be used to predict the future.
These storylines have social and cultural significance too. “Our collective memory lies in our recordings — in our written texts, artworks, photographs, audio and video — and in our retellings and reinterpretations of that which becomes history,” say Alshaabi and colleagues.
Now anyone can study it with Storywrangler. Try it, it’s interesting.
As for Harry Potter, Frankenstein and Dracula, the tale that Storywrangler tells is different from the Google Books n-gram corpus. Harry Potter is significantly more popular than his grim-faced predecessors and always has been on Twitter. In 2011, Harry Potter was the 44th most popular term on Twitter while Dracula has never risen higher than 2653rd. Frankenstein’s best rank is 3560th.
Of course, fame is a fickle friend and an interesting question is whether Harry Potter will fare as well as Frankenstein two hundred years after publication. Storywrangler, or its future equivalent, would certainly be able to help.
Ref: Storywrangler: A massive exploratorium for sociolinguistic, cultural, socioeconomic, and political timelines using Twitter. arxiv.org/abs/2007.12988