9.8 C
New York
Saturday, April 13, 2024

What a Crossword AI Reveals About Humans' Way With Words

logoThe AI Database →


Text analysis

End User




Source Data



Machine learning

Natural language processing

At last week’s American Crossword Puzzle Tournament, held as a virtual event with more than 1,000 participants, one impressive competitor made news. (And, despite my 143rd-place finish, it unfortunately wasn’t me.) For the first time, artificial intelligence managed to outscore the human solvers in the race to fill the grids with speed and accuracy. It was a triumph for Dr. Fill, a crossword-solving automaton that has been vying against carbon-based cruciverbalists for nearly a decade.

For some observers, this may have seemed like just another area of human endeavor where AI now has the upper hand. Reporting on Dr. Fill’s achievement for Slate, Oliver Roeder wrote, “Checkers, backgammon, chess, Go, poker, and other games have witnessed the machines’ invasions, falling one by one to dominant AIs. Now crosswords have joined them.” But a look at how Dr. Fill pulled off this feat reveals much more than merely the latest battle between humans and computers.

When IBM’s Watson supercomputer outplayed Ken Jennings and Brad Rutter on Jeopardy! just a little more than 10 years ago, Jennings responded, “I, for one, welcome our new computer overlords.” But Jennings was a bit premature to throw in the towel on behalf of humanity. Then as now, the latest AI advances show not only the potential for the computational understanding of natural language, but also its limitations. And in the case of Dr. Fill, its performance tells us just as much about the mental arsenal humans bring to bear in the peculiar linguistic challenge of solving a crossword, matching wits with the inventive souls who devise the puzzles. In fact, a closer look at how a piece of software tries to break down a fiendish crossword clue provides fresh insights into what our own brains are doing when we play with language.

Dr. Fill was hatched by Matt Ginsberg, a computer scientist who is also a published crossword constructor. Since 2012, he has been informally entering Dr. Fill in the ACPT, making incremental improvements to the solving software each year. This year, however, Ginsberg joined forces with the Berkeley Natural Language Processing Group, made up of graduate and undergraduate students overseen by UC Berkeley professor Dan Klein.

Klein and his students began working on the project in earnest in February, and later reached out to Ginsberg to see if they could combine their efforts for this year’s tournament. Just two weeks before the ACPT kicked off, they hacked together a hybrid system in which the Berkeley group’s neural-net methods for interpreting clues worked in tandem with Ginsberg’s code for efficiently filling out a crossword grid.

(Spoilers ahead for anyone interested in solving the ACPT puzzles after the fact.)

The new and improved Dr. Fill fills the grid in a flurry of activity (you can see it in action here). But in reality, the program is deeply methodical, analyzing a clue and coming up with an initial ranked list of candidates for the answer, and then narrowing down the possibilities based on factors like how well they fit with other answers. The correct response may be buried deep in the candidate list, but enough context can allow it to percolate to the top.

Dr. Fill is trained on data gleaned from past crosswords that have appeared in various outlets. To solve a puzzle, the program refers to clues and answers it has already “seen.” Like humans, Dr. Fill must rely on what it has learned in the past when faced with a fresh challenge, seeking out connections between new and old experiences. For instance, the second puzzle of the competition, constructed by Wall Street Journal crossword editor Mike Shenk, relied on a theme in which long answers had the letters -ITY added to form new fanciful phrases, such as OPIUM DENS becoming OPIUM DENSITY (clued as “Factor in the potency of a poppy product?”). Dr. Fill was in luck, since despite the unusual phrases, a few of the answers had appeared in a similarly themed crossword published in 2010 in The Los Angeles Times, which Ginsberg included in his database of more than 8 million clues and answers. But the tournament crossword’s clues were sufficiently different that Dr. Fill was still challenged to come up with the correct answers. (OPIUM DENSITY, for instance, was clued in 2010 as “Measure of neighborhood drug traffic?”)

For all the answers, whether part of the puzzle’s theme or not, the program works through thousands of possibilities to generate candidates that would best match the clues, ranking them by likelihood and checking them against the constraints of the grid, such as how across and down entries interlock. Sometimes the top candidate is the right one: For the clue “imposing groups,” for example, Dr. Fill ranked the correct answer, ARRAYS, as the preferred word. The word “imposing” had never appeared in previous clues for the word, but other synonymous words like “impressive” had, allowing Dr. Fill to infer the semantic connection.

Crossing letters often help narrow down the candidates, so that knowing the second letter is O in a five-letter answer clued as “Aw, that’s a shame!” helps the correct answer, SO SAD, bubble up to the top of the list.

The crossword solver is a closed system—it can’t just Google the answers. As a result, there are gaps in its knowledge base. In this regard, too, the program mimics our own imperfect mental capacities, even if its storage and processing speed dwarfs puny human brains. The clue “Poet who wrote ‘Jellicle Cats are merry and bright’” (5 letters) might be obvious to fans of T.S. Eliot, but Dr. Fill initially liked KEATS and YEATS ahead of ELIOT as the poet in question. (Since the Berkeley team’s clue-solving system employs a “black box” approach rather than something more interpretable after the fact, it can be difficult to say why it favors one poet or another.)

And things get particularly tricky with clues involving puns or other wordplay, typically indicated with a question mark. In this puzzle, PERISCOPE got the clue “Sub standard?,” which flummoxed Dr. Fill at first—its top guesses figured “sub” had to do with sandwiches, so it came up with candidates like TUNA ON RYE. Even those bad hunches are illuminating, though: Berkeley’s neural-net system was able to discern that something unusual was going on with a question-mark clue, even if it got stuck on the wrong kind of submarine. The program hasn’t been explicitly taught that a question mark signals some sort of semantic shenanigans, Klein explains, but through machine learning it can gradually surmise that it needs to look for less straightforward options than it would for a regular clue.

Ultimately, however, Dr. Fill was able to solve the crossword correctly in under a minute—a full two minutes faster than any of the human competitors. But, unlike more than 200 human solvers, it wasn’t perfect on all of the puzzles: It got waylaid on two of them and finished with errors. Despite the scoring penalties, Dr. Fill’s blazing speed was enough for it to cling to the top of the leaderboard after seven puzzles, finishing ahead of the fastest human competitor by the narrowest of margins.

New York Times crossword editor Will Shortz, who has overseen the annual tournament since founding it in 1978, noted that this year’s tournament puzzles may have played to Dr. Fill’s strengths, since “every answer was understandable English reading left to right and top to bottom.” (Some years have fiendish puzzles that play with how answers are entered into the grid.) Shortz says that he is “in awe of the ingenuity in programming Dr. Fill to solve tough, sometimes tricky crosswords so well,” but he thinks Team Carbon still has an edge in many ways. “For now humans are still better at dealing with messy, nonlogical, real-world problems like crosswords,” he said, pointing to the fact that even on puzzles that lack an extra level of slipperiness, Dr. Fill still can get tripped up in ways that humans never would.


While the race to the top of the tournament scoreboard garnered the most attention, the joint effort of Ginsberg and the Berkeley team may have other, less headline-grabbing payoffs. For one thing, Dr. Fill is likely to have cleaner finishes in the years to come, as machine learning progresses and the program is fed more puzzles and training data. But Klein sees many challenges ahead, ones that often pop up in the field of natural language processing. For instance, the human mind often navigates what’s called “multi-hop inference,” in which different bits of knowledge are combined in a chain of reasoning. Teaching an AI to follow such leaps of logic points to the subtle ways that people find meaning in language that may be oblique or downright deceptive. Similarly, as Dr. Fill’s confusion over the “sub” clue demonstrated, its brain still struggles to recognize alternative, less common meanings. Consider the misdirection in this clue for a New York Times crossword that I recently collaborated on: “King-like, in a way.” The answer is MACABRE, because “King” here refers to the novelist Stephen King. If an AI could figure out how to solve a clue like that, I might be ready to welcome our new computer overlords.

Klein sees Dr. Fill’s performance as just the first step in appreciating how we are able to unlock meaning from the most recondite of crossword clues. And when it comes to particularly crafty linguistic specimens, such as those involving chains of inferences, Klein says that “the ones that stump people are likely to stump this kind of system even more.” Crosswords will continue to present a unique AI challenge, as they demonstrate that language isn’t just about straightforward communication. It’s a quintessential human trait to be pleasingly puzzled by language at play.

Related Articles

Latest Articles