At last week’s American Crossword Puzzle Tournament, held as a virtual event with more than 1,000 participants, one impressive competitor made news. (And, despite my 143rd-place finish, it unfortunately wasn’t me.) For the first time, artificial intelligence managed to outscore the human solvers in the race to fill the grids with speed and accuracy. It was a triumph for Dr. Fill, a crossword-solving automaton that has been vying against carbon-based cruciverbalists for nearly a decade.
For some observers, this may have seemed like just another area of human endeavor where AI now has the upper hand. Reporting on Dr. Fill’s achievement for Slate, Oliver Roeder wrote, “Checkers, backgammon, chess, Go, poker, and other games have witnessed the machines’ invasions, falling one by one to dominant AIs. Now crosswords have joined them.” But a look at how Dr. Fill pulled off this feat reveals much more than merely the latest battle between humans and computers.
When IBM’s Watson supercomputer outplayed Ken Jennings and Brad Rutter on Jeopardy! just a little more than 10 years ago, Jennings responded, “I, for one, welcome our new computer overlords.” But Jennings was a bit premature to throw in the towel on behalf of humanity. Then as now, the latest AI advances show not only the potential for the computational understanding of natural language, but also its limitations. And in the case of Dr. Fill, its performance tells us just as much about the mental arsenal humans bring to bear in the peculiar linguistic challenge of solving a crossword, matching wits with the inventive souls who devise the puzzles. In fact, a closer look at how a piece of software tries to break down a fiendish crossword clue provides fresh insights into what our own brains are doing when we play with language.
Dr. Fill was hatched by Matt Ginsberg, a computer scientist who is also a published crossword constructor. Since 2012, he has been informally entering Dr. Fill in the ACPT, making incremental improvements to the solving software each year. This year, however, Ginsberg joined forces with the Berkeley Natural Language Processing Group, made up of graduate and undergraduate students overseen by UC Berkeley professor Dan Klein.
Klein and his students began working on the project in earnest in February, and later reached out to Ginsberg to see if they could combine their efforts for this year’s tournament. Just two weeks before the ACPT kicked off, they hacked together a hybrid system in which the Berkeley group’s neural-net methods for interpreting clues worked in tandem with Ginsberg’s code for efficiently filling out a crossword grid.
(Spoilers ahead for anyone interested in solving the ACPT puzzles after the fact.)
The new and improved Dr. Fill fills the grid in a flurry of activity (you can see it in action here). But in reality, the program is deeply methodical, analyzing a clue and coming up with an initial ranked list of candidates for the answer, and then narrowing down the possibilities based on factors like how well they fit with other answers. The correct response may be buried deep in the candidate list, but enough context can allow it to percolate to the top.
Dr. Fill is trained on data gleaned from past crosswords that have appeared in various outlets. To solve a puzzle, the program refers to clues and answers it has already “seen.” Like humans, Dr. Fill must rely on what it has learned in the past when faced with a fresh challenge, seeking out connections between new and old experiences. For instance, the second puzzle of the competition, constructed by Wall Street Journal crossword editor Mike Shenk, relied on a theme in which long answers had the letters -ITY added to form new fanciful phrases, such as OPIUM DENS becoming OPIUM DENSITY (clued as “Factor in the potency of a poppy product?”). Dr. Fill was in luck, since despite the unusual phrases, a few of the answers had appeared in a similarly themed crossword published in 2010 in The Los Angeles Times, which Ginsberg included in his database of more than 8 million clues and answers. But the tournament crossword’s clues were sufficiently different that Dr. Fill was still challenged to come up with the correct answers. (OPIUM DENSITY, for instance, was clued in 2010 as “Measure of neighborhood drug traffic?”)
For all the answers, whether part of the puzzle’s theme or not, the program works through thousands of possibilities to generate candidates that would best match the clues, ranking them by likelihood and checking them against the constraints of the grid, such as how across and down entries interlock. Sometimes the top candidate is the right one: For the clue “imposing groups,” for example, Dr. Fill ranked the correct answer, ARRAYS, as the preferred word. The word “imposing” had never appeared in previous clues for the word, but other synonymous words like “impressive” had, allowing Dr. Fill to infer the semantic connection.