This presentation explores the word-building boardgame Scrabble using the English Open Word List (EOWL), an open source word list that may be freely used to build electronic and traditional word games. Importantly, this project does not use the official Scrabble dictionary, which does not have an open license.
The presentation is structured as an interactive slide show. The show has two primary sections: a dynamic section where each slide of the show is rendered and a static section which persists between slides showing the shows navigation controls. Each slide contains four distinc sections. The top section contains the slide's title and description. The left section contains user input controls which are bound to parameters specific to each slide. The central section contains the D3, Vega, or Vega-Lite driven visualization for the slide. Lastly, a floating, collaptible panel titled "Explorations" contains suggested, guided advice for users as they progress through the show. The first slide, shown as the landing page, explains each of the sections in more detail and provides a simple example visualization to guide the user through each section.
The following is an outline of the annotations, parameters, triggers, and UI events available on each slide of the presentation.
The presentation is built using a combination of the d3.js, vega, and vega-lite libraries. Some basic styling is provided by the PureCSS framework, and icon fonts by FontAwesome. The general font is Dosis, provided via Google Web Fonts The slideshow framework and GADDAG implementations were written from scratch using only D3 and native HTML5 browser capabilities. The full source code is available via GitHub at axiomabsolute/gaddag, and is hosted on my personal statically served website via Netlify at axioms.me.
Library | License |
---|---|
D3.js | BSD 3-clause "New" or "Revised" |
Vega | BSD 3-clause "New" or "Revised" |
Vega-Lite | BSD 3-clause "New" or "Revised" |
PureCSS | BSD 3-clause "New" or "Revised" |
FontAwesome | SIL OFL 1.1 and MIT for the font and code, respectively |
Dosis Font | Open Font License |
English Open Word List (EOWL) | The EOWL is derived from the UK Advanced Cryptics Dictionary and features the following license restriction: UK Advanced Cryptics Dictionary Licensing Information: Copyright © J Ross Beresford 1993-1999. All Rights Reserved. The following restriction is placed on the use of this publication: if the UK Advanced Cryptics Dictionary is used in a software package or redistributed in any form, the copyright notice must be prominently displayed and the text of this document must be included verbatim. There are no other restrictions: I would like to see the list distributed as widely as possible. |
At its core, Scrabble is a game of anagrams. Each player blindly draws letter tiles from a bag to build up a hand of 7 tiles and tries to build up a crossword style board of interconnected words. A key skill for players of the game is the ability to find valid words within an otherwise jumbled hand. This presentation explores the challenges of finding valid Scrabble moves and some of the techniques players can use to efficiently find valid words. The visualizations use the English Open Word List (EOWL), an open source word list that may be freely used to build electronic and traditional word games.
Below is a sample visualization that describes the structure and setup used in the rest of the presentation. The floating "Explorations" panel at the bottom right of the screen provides more details about the structure of the slide show and the interactions supported by each slide.
Throughout this presentation, helpful information will appear in this explorations panel. If you find the panel is in your way, you may dismiss the panel by clicking on the title or Schevron icon at the top of the panel.
Each visualization is designed to be interactive and parameter driven. Parameters to the visualization will appear on the left-hand side of the screen. Try changing the value in the input field and clicking the update button. This updates the visualization to display the entered letters in a jumbled order. After a brief delay, the letters unscramble themselves to match the order of the input.
Any interactions that are not obvious will be described here. For example, try clicking on one of the colored circles in the visualization area to scramble the letters.
To move on to the next visualization, use the arrow and page controls on the right-hand side of the screen.
Two-letter words are the smallest allowable plays in Scrabble. Given that the English alphabet has 26 letters and that the game includes at least two of every letter (by including blank tiles) there are a total of 26*26=676 possible two-letter plays. Of these, only are valid words. The invalid words are called phonies. These words may still be played, but the player risks losing their turn if the word is challenged by an opponent. A player who incorrectly challenges a word, that is who attempts to challenge a valid word, loses their turn instead.
Even in the simple case of two letters, finding valid words and correctly calling out an opponent's phonies can be challenging. There are many more phony words than valid words, and many of the valid words are not common in everyday use, like the very first two-letter word, aa, which is a particular type of lava.
Click the horizontal axis title to collapse the data along the first letter of each word. The resulting visualization shows the number of valid words
per starting letter. Note that the horizontal axis has a fixed length of 26, since there are 26 possible letters to pair with each starting letter. Some letters, like
a
may be paired with many other letters to form two-letter words. Others, like c
, have very few two-letter combinations. The letter
v
is never used as the first letter of a two letter word!
Hover over marks to show details about each item, either the individual word's validity or the percentage of two letter words starting with the given letter that are valid words, depending on whether the horizontal categorical axis is collapsed.
Correctly identifying valid and invlaid two letter words can be challening for casual players, but even novice competitive Scrabble players simply memorize the relatively small two-letter word list.
As length increases, the number of valid words of that length generally increases, peaking at words with 8 letters. On the one hand, this makes finding longer words easier since there are simply more options. On the other hand, however, the number of invalid words grows at a much higher rate. This notion of sparseness makes finding longer playes more difficult, especially for new players, who are often simply overwhelmed by the number of invalid words. This visualization illustrates the inverse relationship between the number of words by length and the sparseness of valid words by length.
Note that restrictions on the number of letters available for gameplay are not taken into account when computing the total set of possible plays of each word length. Computing the exact number of possible plays for longer length words presents an interesting combinatorial challenge, complicated significantly by the presence of blank tiles, that is beyond the scope of this presentation. Instead, a simple upper-bound estimate which assumes that there are always enough letter tiles such that a word can have any letter in any position is used.
Use the controls on the left to explore the relationship between the number of words of each length and the relative sparseness of valid words of that length.
The button toggles the chart between displaying the count of words of each length and the percentage of words of that length that are valid words. Notice that while there are many more 8-letter than 2-letter words, valid 8-letter words are in fact much more sparse.
The scale of both the counts and percentages of valid words is fairly large, spanning several orders of magnitude. To help ensure the smaller values are still visible, use the Y-Axis Power Scale Exponent slider control to adjust the exponent used to generate the power scale in the y-axis. When the slider is set to 1 the scale behaves linearly, while lower exponents help emphasize the smaller values. Note that the marks along the y-axis compress more and more as they reach the top of the chart.
Players often rely on simple linguistic patterns to help more efficiently identify valid plays. These patterns take the form of prefixes, suffixes, and other common combinations of letters.
The suffix -ing
is commly used in English to form the present participle of a verb, such as the form of
the verb "swim" in the sentence "I like swimming". It may also be added to a verb to make it behave like a noun, creating what is known as the gerund
form of the verb, such as the word "programming" in the sentence "Programming is fun". It is also quite commonly used independently in
both nouns and adjectives, such as "ceiling" and "morning".
There are several ways that players can use patterns to efficiently find valid words. Once a player has identified that they have a pattern in their hand,
they can ignore those letters and focus on the remaining letters to find a smaller compatible word to
add the prefix or suffix to. For example, if the player has the letters ing
in their hand, they can focus on looking for a verb within the
remaining 4 letters of their hand, since most verbs can take the suffix -ing
. Further, some combinations of letters are used almost exclusively
together in a particular pattern. For example, given that a word contains the letters "i", "n", and "g", the probability that those letters are used in the sequence "ing"
is nearly 80%.
Lastly, prefixes and suffixes are often used together. If a player can identify both a prefix and suffix in their hand they can sometimes find
existing words on the board that can simply be augmented to create larger plays. For example, with the letters deer
in hand,
the player can form the prefix re-
and suffix -ed
, allowing them to play off of many English verbs simply by prepending and
appending the prefix and suffix to the existing play.
Using the default pattern parameter ing
, note the first bar of the left hand chart of the visualization. This bar represents
the probability of a word containing the exact sequence ing
given that it contains each letter of the sequence individually. Thus,
the word singer
would be counted, but gin
would not. Hovering over the bar displays a tooltip with relevant values. Note that
nearly 79% of words with "i", "n", and "g" in them contain the exact sequence "ing". This means that if a player has the letters "i", "n", and "g" in hand
then most valid plays containing those letters contain the exact pattern "ing", allowing them to focus their search efforts. Also note from the second bar
of the same chart that nearly 60% of words with "n" and "g" also contain an "i". This means that most valid words containing "n" and "g" also contain an "i",
implying that if a player has an "n" and "g" in hand but does not have an "i" then they are less likely to have a valid word in hand.
The second chart highlights the relationship between different patterns. For example, the most common pattern to pair with ing
is the suffix
-s
(as in the word fittings
), with 1922 valid words. This chart suggests companion prefixes and suffixes which may likely be combined
with the user entered pattern to form a valid word. Words constructed by combining a stem with a prefix or suffix are known as
affix morphemes.
Try updating the pattern parameter to qu
, a two letter pattern with an exremely high coocurrence reate in the English language. Given that a word
contains the letters "q" and "u", how likely is it that the word contains the exact sequence "qu"? Given that a word contains the letter "q", how likely is it
that it contains a "u"? Does that relationship hold the other way around?
Try an unusual combination of letters, such as syr
, which shows up in words such as syringe
. Given that a word contains "s", "y", and "r"
independently, how likely is it that it contains the exact sequence "syr"?
Try a technical suffix, like nym-
. What words might be formed by combining that suffix with the suggested prefixes?
A GADDAG is a data structure created by Steven Gordon to enable Scrabble playfinding and board analysis programs.
The data structure is a specialized form of a trie which stores multiple paths for each word for more
efficient query capabilities. For each position in each word, a path is stored starting at the position and walking backwards to the beginning
of the word. This path is known as the prefix. At that point, a special token called the turn token, represented by the
character >
, is inserted. Following the turn token, the remainder of the word is inserted in the original order that it appeared for the word.
This portion of the path is known as the suffix. These prefixes and suffixes are not to be confused with their linguistic counterparts, and
are used solely to refer to portions of a word relative to how they're stored in the GADDAG.
To read an individual path for a word, start at a turn node, represented in orange, and follow the edges backward towards the root node on the
left. These letters, in the order they are traversed from the turn node, form the prefix of the words stored along that path. Returning to the same turn node,
follow the edges leading away from the root node towards a green node, which represents the end of a valid word. This portion forms the suffix of a word stored
along that path. Combining the computed prefix and suffix forms a complete valid word. For example, starting at the top-most turn node in the default visualization,
follow the edges to the left that lead back to the root node. This process yield the prefix c
. Returning to the same root node at the top, follow the
edges leading to the right towards a complete word. In this case, there is only one path, which yields the suffix all
. Since the final node is a
complete word (represented by a green dot), the combination of the prefix and suffix yields a complete word, call
.
This layout may seem overly verbose and unintuitive, but there are two benefits to this trie layout. First, it enables a deterministic, efficient algorithm to generate all valid words given a lexicon, a player's hand, and optionally an existing board position, allowing the system to search for words by starting at any point within the word. Second, it enables a significant amount of the trie to be minimized during construction by re-using the nodes used to represent the suffixes of a word. This partial minimization of the trie, which suffix nodes reused within a word's subtree, restructures the trie as a directed acyclic graph (DAG).
The GADDAG is one of two primary data structures used in Scrabble playing AIs for move generation, and in particular is used in the Quackle system. The visualization below illustrates some of the basic move finding and work validation operations supported by the GADDAG.
By default, the GADDAG in the visualization contains a single word, call
. Starting at the root, there is a path
for each possible prefix for the word, in reverse order. That is, for the word "call", there is a path for llac
,
lac
, ac
, and c
. Each such reversed prefix path is followed by a special separator node
indicated by the >
character, called the turn token. Following the turn token for each path is
the remainder of the word. For instance, the prefix path ac
is followed by the turn token >
and
then by the suffix path ll
. Traversing the structure from the root to a turn token yields a reversed prefix for a word.
Traversing from a turn token to a complete word node yields a word suffix (in the order it originally appeared in the word). Combining
these two yields a complete word.
The control panel contains several input boxes and associated buttons. Each input and button combination corresponds to one of the basic operations supported by the GADDAG. Words that match the given query are displayed below the legend in a list. Each input and button combination is described below:
?
character to represent a blank tile (that is, a tile which
matches any letter, e.g. ?all
would match both call
and fall
).
Each operation highlights relevant nodes traversed during the operation. Nodes that are not traversed are grayed out. The legend
on the right side of the visualization describes the meaning of each color. The Clear
button clears the current path
highlighting.
Add the words care
, card
, and falls
. How many paths are searched to look for the prefix
cal
? How about just ca
? What does the path look like when searching for the suffix s
?
Clear the path highlighting by clicking the "clear" button (you may have to scroll the page to reach it). Pay very close attention to
the graph when performing this next operation: add the word fall
. What changed? Did the structure of the graph change?
It shouldn't! This operation shows the efficiency of how overlapping words are encoded in the GADDAG. When one word completely overlaps
another, (as is the case for falls
and the newly added word fall
) the strucutre of the GADDAG does not change?
at all, rather existing nodes are simply marked as representing complete words. In this case, two nodes change color from blue to green.
How many words match the hand acdeflrs?
? Node that the ?
character in the hand represents a blank tile for this
operation only. Note that this operation highlights many more paths within the GADDAG, producing many more false starts (with more nodes
highlighted in red to indicate a halted search traversal path).
Identifying valid Scrabble moves presents a challenge to both human and AI players. This presentation explored one small aspect of that challenge, namely the task of finding valid words within a jumbled set of letters.
For human players, the overwhelming sparseness of valid words among the vast set of playable (though potentially phony) words becomes the bottleneck for play finding. One technique to aid human players is to focus on common linguistic patterns within the host language to build compound words and affix morphemes out of prefixes, suffixes, and smaller valid words. Furthermore, considering the relationships among the letters within a particular pattern can help human guide how human players search for valid words, prioritizing the most common letter configurations. Lastly, combining multiple affixes and knowing which affixes may be commonly combined can help players construct even larger plays. These techniques provide a useful starting point, but necessarily make a trade-off as the player's focus shifts away from the valuable set of words which do not contain such patterns.
Computer AIs can evaluate plays much more quickly than human players, and have the benefit of perfect recall. The GADDAG data structure provides one method of storing a lexicon in such a way that valid plays may be efficiently searched for by prefix, suffix, substring containment, or with hand restrictions. While this gives computer AIs some advantages, there are still many aspects of advanced Scrabble gameplay that are not covered by the techniques demonstrated here. Evaluating and choosing the best play is much more complicated, and involves considering the active board shape, tracking the remaining letters based on the currently visible letters, managing the player's remaining letters after a play to enable better subsequent plays (a process called rack management), and much more. The top tier Maven Scrabble playing AI incorporates these aspects and many more. Interestingly, though the GADDAG offers better performance, Maven uses a more memory compact, less runtime efficient data structure known as a Directed Acyclic Word Graph (DAWG) to enable smaller build artifact sizes for downloaded games.
Despite advances in game playing AI systems, Scrabble continues to be a fun and challenging game for players and a rich, interesting topic of study for data scientists, statisticians, and mathematicians.