What pleas they may fuck out of such books: Google Ngrams vs Long-S

Ever seen an old printed book with the letter S that looks like an F? This ligature, to the uninitiated, looks like ſ; it’s called the ‘long s’, and it has very much fallen out of use in modern typography. John Bell is widely credited for the demise of the long S, which is why we don’t see it very much any more, but it is often seen in European books printed between the 1400s and 1790s.

The google ngram reader relies heavily on optical character recognition (OCR) software to make their books searchable; OCR software  strives to match each printed character in a text to a recognized typographic character. Even human readers can have difficulty with reading text which heavily use the ſ, as seen from this 1739 printed example of Ben Jonson’s The Alchemist:

the-alchemist-1739Ben Jonson’s The Alchemist: A Comedy, first performed in 1610 and published 1739, from the Internet Archive

The Google Books Ngram project is a thoroughly imperfect resource for studying linguistic change in English-language print, mostly because out-of-fashion typographic conventions such as long-S completely throw off searches. To the untrained eye, or to a computer doing its very best to apply modern rules to anachronistic text, the word ‘suck’ using the long-S looks an awful lot like “fuck”. Google seems to know about it, too, as they make their default search dates 1800-2000, but you can easily change that to 1500-2000 and observe the differences in uses between ‘suck’ and ‘fuck’. The primary difference is that between 1650 and 1790, ‘fuck’ appears to be printed far more often than than ‘suck’, with a noticeable switch around 1665:


suck:fuck googlengram screenshot

Most of these examples involve breasts or blood, as these are frequently collocational and semantically-linked relationships for the act of sucking, but then there are others…
suck:fuck 1suck:fuck3

suck:fuck8

suck:fuck5 suck:fuck6suck:fuck7

suck:fuck9

Of course, this makes print prior to the 1800s seem even more concerned with sex and sexuality than the period already was! Sadly for us, before the Victorians and their complex relationship to sexuality came to the forefront, long-s fell out of popularity, and our modern use of ‘fuck’  re-enters print around the 1960s:

fuck 1960s

Fuck me, how times have changed!

5 thoughts on “What pleas they may fuck out of such books: Google Ngrams vs Long-S

  1. John Cowan April 3, 2015 / 3:36 pm

    Man, this is a load of fucking crap. Firſt of all, OCR can diſtinguiſh eaſily between f and ſ if it’s programmed to; the difference is hardly more ſubtle than between vv and w (fortunately, there are few words with “vv” in them). For that matter, the diſtinctions between i/j and u/v, which were coming in when ſ was going out, are alſo pretty ſmall. Google ſimply didn’t bother fixing its OCR.

    And Google defaults to diſplaying Ngrams after 1800 becauſe its databaſe of books before 1800 is quite limited. It has zero to do with ſ.

    Moral: Verify your fucking facts, or your articles will ſuck.

    Like

    • heatherfroehlich April 3, 2015 / 8:21 pm

      If you have an OCR package which can easily distinguish between f and ſ, I’d love to know about it, as I’m not aware of any and neither are any of the people I consulted. I know of several ongoing efforts to improve OCR quality, especially on older printed texts, which often introduce noise through the aging process (and is highly dependent on the quality of paper printed on: the quality of print on vellum is very different than that on parchment). The less old a book is, the higher the likelihood of getting a good OCR match.

      But if you mean that it would be easy enough to program OCR to address long-S and f in old texts as well as it addresses other letters (e/c or a/o, among others) – i.e. not very well, despite the possibly of building a lexicon and correcting everything from there – then I have no qualms on that front. Similar problems are found where the Latin eunt becomes cunt, as a relevant sweary example, and you could feasibly correct all cunt strings into eunt. OCR has made significant advances since the first iteration of the Google Ngrams searcher arrived on the scene, but the only people in a position to run this kind of solution would be Google, and as far as I can tell they haven’t done this.

      As for my guess that Google doesn’t offer pre-1800 texts due to messy data, it seems that it would behoove them to not show off where their system is less elegant as a default. But thanks for the correction.

      Like

      • John Cowan April 4, 2015 / 4:15 am

        It’s hard for me to be sweary without coming off as rude, for which I’m sorry. I’m not normally this peppery.

        Like

  2. bmschmidt April 3, 2015 / 10:24 pm

    I doubt it’s truly as “easy” to fix this problem as the first commenter suggests. One of the important differences between 2009 and 2012 Google Ngrams is that Google does seem to have tried to improve their OCR on the long-s problem as you can see by comparing fuck/suck in the two corpora. But they still ended missing about half of them. For fuch/such, on the other hand, it’s almost always correct before 1800 now and never was in the 2009 version. Since “fuck” is a word and “fuch” isn’t, that would imply to me that it’s only with the additional model of a stochastic language model or dictionary check or something that they were able to address the issue.

    Like

  3. John Cowan April 4, 2015 / 4:16 am

    As far as I know, all OCR is dependent on predictive models that have a good idea what to expect.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s