Generative Dictionaries:
WordHack Open Projector Talk

The following is a textual reproduction of an open mic lightning talk I gave about different generative dictionary projects and techniques, along with the slides I used for it.

The original talk was given on November 19, 2020 at WordHack, via Babycastles.

A video recording is available on YouTube.

Hello everyone! My name is Robin Hill.

Today I’m going to talk about generative dictionaries – programs that generate dictionary entries, with words and meanings.

A few years ago I made one called Lyre’s Dictionary in the form of a Twitter bot, and afterward I did some searching online to find similar projects that other people had done.

I found a few interesting things that I want to talk about! People have taken different approaches to this, and I’ll describe a few.

The most common approach in the projects I’ve seen is using neural networks.

I won’t go into the details of how these work since I’m by no means an expert and it’s not that important.

One thing that’s attractive about them as a tool is that they can be applied to different tasks without being adapted much. So someone doesn’t have to really create a whole system for making words or definitions, they just take this general purpose tool, feed it real dictionary entries, and it will try to copy them.

These examples are from Thomas Dimson’s This Word Does Not Exist.

So taking a closer look at one of these, we can see that they can produce some good looking results, including some extra flourishes like pronunciation and example sentences. But sometimes the definitions are a little garbled, since there’s not an internal structure of meaning behind them. They’re fundamentally imitative, adding letters and words together to try to make things that resemble what you fed into it.

So a different approach is to build words and definitions up out of smaller discrete elements.

This example is from my bot, Lyre’s Dictionary. It came out of a love of etymology, so I was very interested in the particular parts of words and how they fit together.

The building blocks here are roughly equivalent to what in linguistics are called 'morphemes' – indivisible units of meaning that make up a word: roots, suffixes, etc.

The goal is to combine the components we see in actual English and see what else we could make with them. Procedural generation is sometimes described as exploring a possibility space; here, it’s the possibility space of the history of the English language.

Let’s look at another example of this type: Fantastic Vocab by Greg Borenstein

The unit here is also existing English word components, but they're combined in a more free-form way, corresponding less tightly to historical patterns.

So the sensibility is a little different here; while they’re less “plausible” in a sense, combining with fewer restrictions increases the number of possibilities.

One advantage of working with etymological units is that they are often recognizable. So here we have a few words that all have the “trans-“ prefix, meaning "across", and some with the “ject” root, which comes from a Latin word meaning “to throw”.

If we create a new word using these elements, “transject”, it has the advantage of resembling these existing words, and so you might have some intuition about its meaning – to throw across.

Here, etymology aids comprehension.

Now let's look at a different example.

There is a Latin word "frangere", meaning to break or shatter, which is the ancestor of several modern English words, including "refract" and "frangible", a not-so-common word meaning "able to be broken".

If we wanted to create a word that means "able to be refracted", strictly following historical patterns would give us "refrangible". But if we said "refractable" instead, it would be more widely understood, since it looks more like other words people know.

So here, etymology hurts comprehension.

Let’s look at one more technique. This example comes from Power Vocab Tweet by Allison Parrish.

The definitions here are generated using Markov chains, which like neural nets recombine existing material, but one interesting product of the way this approach works is that it sometimes “welds” together two existing definitions, joining them at a point where they share a word or phrase.

Doing some searching, I think I found the real dictionary entries that were combined to make this example (or at least plausible candidates).

So if we look at the two phrases being combined, each one is a meaningful but not quite completed idea. We have the idea of “A European goose that’s smaller than something”, and “something that is smaller than the common garden nasturtium. And that “smaller than” comparison is a place where they’re joined into a new, whole idea.

So that’s about all I have time for here. You can read the full list on my web site, at this link.

My bot Lyre’s Dictionary is an ongoing project which I hope to continue working on, and you can follow it at the Twitter and Mastodon links.

And you can contact me and see other things I’ve made on my web site and on Twitter. Feel free to reach out if you want to talk about generating dictionaries!

Thank you so much!