Friday, May 24, 2013

Notes on an unwritten paper -- Naive Bayesian Classifiers and Order of Composition

[Update: I've got some more thoughts on Gutenberg-based research in my latest post.]

I'm planning on writing some posts on the potential of and the potential concerns about open data (possibly even getting Joseph to join in) so I thought I'd dust off a somewhat relevant idea I had a few years back. If anyone wants to see if they can get something publishable out of this, feel free. In the meantime, I plan on getting some mileage out of it as an example.

A few years ago, I wrote some code for text mining. It was really basic, standard stuff -- using naive Bayesian classifiers and n-grams (normally techniques for assigning authorship) -- but it worked well and was fun to play around with. I used various books from Project Gutenberg as test data and selected authors with styles and backgrounds ranging from close (Dickens and Trollope) to out there (Thorstein Veblen) with a translation of Verne as someone neutral. The two Victorians also had the advantage of having written lots of books over many years.

The idea was to approach this less as a classification problem and more of a question of distance between points in a literary space. Here the "likelihood score" was more a measure of similarity. As you would expect, Great Expectations was more similar to Nicholas Nickleby than to Barchester Towers, more similar to Barchester Towers than to a translated Master of the World and more similar to Master of the World than to Theory of the Leisure Class. It also worked as expected when you compared works of the same author written at different points in his career: Great Expectations (1860 to 1861) was more similar to Our Mutual Friend (1864 to 1865) than to Nicholas Nickleby (1838 to 1839).

Obviously this was a tiny trial run, but it did suggest that there's something out there, as did a recent literature search which turned up at least one related paper from 2011 ("Predicting the Date of Authorship of Historical Texts" by A. Tausz) which used NBCs to determine absolute rather than relative dates. Still even with Tausz' paper (which is very interesting, by the way) there still should be room for research into intra-author questions and, more importantly, into lots of other questions using data from project Gutenberg.

And on top of that you can apparently find interesting stuff to read at the site as well.

No comments:

Post a Comment