Designing spaced repetition systems for play, not work

Lab Diary 001

Jan 29, 2024

Spaced repetition systems are a lot of effort. Creating a card today schedules a unit of work tomorrow, next week, and so on in perpetuity. There's a natural progression system at work, but engaging in it requires too much tedium and overhead for an unclear payoff.

In video game lingo, we call that a grind - the act of doing repetitive tasks over time to achieve results. But similar to how we can motivate players to grind in video games, we can also encourage users to remember meaningful information with the help of memory systems.

This research diary serves as a reference for Pine's scoring system and its initial development. The first section cites grand strategy games as an influence on the memory score. The second section details how memory scores are calculated. The third section explores how to convey a sense of progression while discouraging system exploits. The fourth section discusses how tooling can tie into broader memory culture. The final section concludes with the main takeaways.

Grand strategies

A gameplay screenshot of Crusader Kings 3 that previews the character sheet of my councillor, Toirdelbach Briain. Being a title claimant and having a low opinion score towards me increases the likelihood that he will join a claimant faction.

Grand strategy games (GSGs) are a genre that I've sunk hundreds of hours into. These are relatively obscure titles that involve dealing with various political, economic, and military systems in a historical context. These games have the challenging task of being both a historical simulation and an engaging experience that makes players want to return.

Crusader Kings 3 (CK3), for example, expresses a theory of history that focuses on personalistic rule and puts you in the position of a medieval ruler. All its intricate mechanics, from dynasty and realm management to warfare and diplomacy, must have feedback loops back to you, the ruler.

CK3's genius solution to intertwine these features is to give every playable character a visible opinion score of every other character. This matters because opinion dictates how an otherwise opaque character will behave. Characters with high opinions of you increase the chances of forming alliances and centralizing the realm, while those with low opinions might scheme or raise vassal factions against you. By rooting the complexity to a single score, CK3 anchors the player to one simple, repeatable task - making the opinion score go up.

It's then no surprise that our exploration of spaced repetition systems begins with a memory score that anchors users to a theory of memory processing.

Memory scoring

In-application screenshot of a Pine memory score that is shown during new user onboarding.

As shown above, a card's memory score ranges from -100 to 100 and attempts to simulate the encoding, storage, and retrieval phases of memory processing.

Encoding refers to how well you capture information and can be simulated by historical card performance and current card quality. The following modifiers dictate the score:

Mastery is a +20 modifier that is applied to all cards and is increased by surpassing card creation and review thresholds (more on this in progression systems).
Question quality is a +15 modifier that is increased by asking “good questions”.
Answer quality is a +15 modifier that is increased by providing “good answers”.

What determines a good question and answer is subjective but incredibly important. “Bad” prompts fail to activate the memories that you want to reinforce, while “good” prompts test knowledge to produce learning1. For now, Pine takes an all-or-nothing approach when you input information, but future versions will experiment with a Grammarly-like interface for offering dismissable corrections.

Storage involves the consolidation of information into short and long-term memory and is simulated through associations, connections, and demonstrably retaining information:

Associations is a +12 modifier that rewards attaching notes and cues to information.
Connections is a +6 modifier that rewards cross-linking information.
Retention is a +32 modifier that rewards recalling cards over gated review thresholds (1, 3, 7, 14, 30, 90, 180, and 365 days).

While associations and connections make minor contributions to your total memory score, retention accounts for nearly a third of all positive weighting. The retention modifier has been adapted from prior work on tools for thought and rewards the time between a successful review of the card and the preceding review of that card.

For example, if I recalled a card on Day 1 and again on Day 2, I would be awarded a +4 score for satisfying the first gated review threshold of one day. If I recalled the card again on Day 5, I'd earn an additional +4 score for the three days that had elapsed between Day 2 and Day 5. The “gating” mechanism is vital for demonstrating memory progression and discouraging score exploits. If the first gate of one day had not passed, then reviewing a card on Day 1 and skipping to Day 5 for the following review would award a +4 score instead of +8.

Whereas encoding and storage contribute positive modifiers to a card's score, retrieval contributes negatives. Retrieval is the process of recalling stored information when needed and is simulated by requiring testing of information as soon as it is ready for review:

Coming due is a modifier that linearly decreases to -10 until a card is ready for review.
Overdue is a modifier that exponentially decreases to -90 when a card is overdue.

If a card that I created on Day 1 is due on Day 3, then we can track what happens as we approach Day 5 without review:

On day 1, both modifiers start at 0.
On day 2, coming due decreases to -5 as the card approaches review.
On day 3, coming due decreases to -10 as the card becomes ready for review.
On day 4, coming due stays at -10, while overdue decreases towards -35.
On day 5, overdue continues its exponential decrease towards -90.

Together, these negative modifiers hypothesize the decline of memory retention in a manner similar to Ebbinghaus’ Forgetting Curve2.

By attempting to simulate the encoding, storage, and retrieval mechanisms of memory processing, we arrive at a scoring system that is encouragingly dynamic. Cards that are nurtured bloom and reach higher score peaks, while those that are neglected wither and have scores that demand attention. The next step in this process is to convey score progression at both a user and community level.

Progression systems

In-application screenshot of a user increasing their mastery level as they satisfy different card and review thresholds.

Progression systems are at the core of most video games. As games advance, they introduce features that allow players to advance with them. These come in the form of experience points (XP), skill trees, equipment upgrades, and challenging scenarios.

Pine adopts a “mastery system” when considering user progression, with your “mastery levels” increasing once you reach specific card creation and review thresholds:

The card thresholds are 5, 75, 225, 400, 625, 900, 1225, 1600, 2025, 2500, 3025, 3600, 4225, 4900, 5625, 6400, 7225, 8100, 9025, and 10000 cards created.
The review thresholds are 5, 300, 900, 1600, 2500, 3600, 4900, 6400, 8100, 10000, 12100, 14400, 16900, 19600, 22500, 25600, 28900, 32400, 36100, 40000 reviews completed.

For example, reaching Level 1 requires creating 5 cards and completing 5 reviews, while reaching Level 2 requires creating 75 cards and completing 300 reviews.

As the section image suggests, a user's sense of progression is broken down into a rank and a score component:

Increasing your rank decorates your border to signal mastery across six ranks: novice (grey), apprentice (green), practitioner (blue), expert (purple), master (orange), and grandmaster (gold).
Increasing your score adds a +1 modifier that is applied to all cards. Reaching higher levels demonstrates encoding proficiency and is thus incorporated into the mastery modifier of a card.

To make progression feel meaningful in a broader community context, Pine ties memory scores into a leaderboard system where each user can make progress alongside other users. Making this work requires aggregating and averaging a user's card scores to produce an overall score. At midnight GMT, this score is used to determine your point increase/decrease in a global leaderboard.

For example, a user with an average card score of 47 would earn 0.47 points on the global leaderboard. By using averaged values instead of summed ones, we can reduce the discrepancy between the points earned by experts and novices. It's worth noting that, because of mastery levels, users with more extensive card libraries should still earn more points than users with smaller ones. However, a disciplined user with a smaller card library can still outcompete an expert user if that expert user accrued a large review backlog (because the negative weighting of the retrieval modifiers in the backlog would bring down their card average).

Creating any leaderboard ranking introduces the possibility of exploits or bad-faith participation. Fortunately, the combination of memory scores, mastery levels, and averaged card values makes it inherently difficult to “game the system”. Let's look at a few examples:

A user immediately creates a large card library to boost their mastery level. This exploit overlooks that mastery increases require passing both card creation and review thresholds. Additionally, you can decrease your mastery level if you fall below the thresholds established for your current level. For example, creating 10,000 cards on Day 1 and deleting them on Day 2 would reset your mastery level back to 0.
A user boosts their mastery level with a large card library but then changes their review interval so that cards stop coming up for review. This exploit takes advantage of Pine's custom scheduling and effectively works around the negative retrieval modifiers. The problem with this approach is that memory score values are averaged, not summed, and gaining high score averages requires passing gated review thresholds. By ignoring reviews of created cards, you're forfeiting the retention modifier that helps them achieve high scores.
A user boosts their mastery level by creating a large card library, reviews them to pass gated review thresholds, and then changes their review interval so that they stop coming up for review. This… isn't an exploit, as it mimics the desired behaviour that spaced repetition systems try to promote. By the time you pass the later gated review thresholds and change your card intervals to never come up, your cards would have already been operating at multi-year intervals.

Basically, gaming progression approximates to using the application as designed.

Memory culture

In-application screenshot of the Pine leaderboard and its top 3 participants.

In how to make memory systems widespread, Michael Nielsen imagines that a strong memory culture would involve forming communities of practice where individuals engage in discussions, share techniques, and push the boundaries of memory skills. These communities would exist online and offline, with some members sharing their practices through various platforms such as books, videos, and conferences. In his depiction, a mature culture would lead to a better understanding of memory systems and their effective use in multiple domains. Yet more critically to the essence of this essay, he believes that the ability of a tool to influence memory culture is secondary to the development of a strong memory culture itself.

I’m not sure I agree with this.

I believe that better tooling can significantly contribute to the development of a strong memory culture and that the right tools can expand and deepen niche memory practices.

Speaking of niches, another genre of video games I enjoy is computer role-playing games (cRPGs), which are rooted in tabletop Dungeons and Dragons (DnD). Similar to GSGs, these games feature a degree of complexity that deters many newcomers. They have in-depth character customization, intricate combat systems, strategic party management, complex storytelling, extensive dialogue options, etc.

However, the communities that develop around these games share many similarities with nascent memory culture. Baseroom communities gather to play tabletop DnD games, discussions take place in forums about improving character builds, and books and videos are produced to help spread the culture. And yet, in parallel to these grassroots efforts, a cRPG called Baldur’s Gate 3 swept the game awards, earned the title of game of the year, and introduced millions to a community they would never have thought themselves interested in. By retaining the strengths of cRPGs and emphasizing its qualities to wider audiences (personal storytelling, stellar voice-acting, cinematic presentation, accessible difficulty, and so on), Baldur’s Gate 3 validated what the cRPG community has always known to be true - that these games provide incredible experiences that deserve to be shared and expanded upon.

Memory tools can have a similar impact. By emphasizing the strengths of memory systems and correcting their weaknesses, we can grow memory culture tenfold and fan the flames that lead to further improvements. This essay has covered one such direction in which memory systems can improve, but I would also like to see improvements in other areas:

Improved interface for creating prompts - novice users are most likely affected by bad prompts, leading to frustration that may bounce them off spaced repetition systems altogether. I'd like to see more Grammarly-like interfaces that nudge users towards good prompt-writing practices, offering fast, targeted feedback when they slip into anti-patterns.
Using AI for personalistic card creation - I'm always torn about embracing AI during card creation. On one hand, writing cards is an essential skill that leads to a deeper connection with the material. On the other hand, developing writing skills isn't the point of memory systems - the point is to test knowledge to produce learning. Thus, I would like to see more tools offer AI-assisted card creation options. For example, if I'm reading a passage of text, I’d like to highlight it, trigger a keyboard shortcut, and get suggested prompts that help encode that information. In this manner, I’m still playing a personal role in underlining the data, but I’m also acting as the final point of review before those prompts are added to my library.
Promoting serendipity during card reviews - I’m interested in ideas that encourage users to develop a meaningful connection to the review process itself. One example would be to treat card reviews as not only a place to test knowledge but also as a place to promote serendipity and connect experiences.
Encouraging contextual card reviews - I’ve recently started surfacing related card reviews when creating new cards. For example, if I'm writing a card on distributed systems, I’ll see backlogged computer science reviews tucked underneath the current card. By providing more contextual review opportunities, it allows them to be done in a way that feels more natural and doesn't require allocating dedicated time.

Summary and takeaways

In-application screenshot highlighting contextual review suggestions at the bottom of a card.

Okay, we've covered a lot here, so these are the main takeaways:

Video games can be a rich source of inspiration for designing systems that require repeated effort or “grind”. The first section demonstrated this by highlighting how grand strategy games anchor players by using singular scores that the player increases.
Memory scores can provide a simulated overview of the health of your memory system. The second section demonstrated how the encoding, storage, and retrieval components of memory processing can be broken down into a dynamic memory score that the user is encouraged to increase.
Intrinsic and extrinsic motivation systems can convey a sense of progression at a user and community level. The third section explained how Pine's mastery system increases a user's rank and score and how memory scores can tie into community leaderboards in an exploit-resistant way.
Memory applications can help seed the development of a vibrant memory culture. The fourth section connected the development of memory culture to that of other niche communities, such as cRPGs, which have grown in popularity due to video game applications. Memory systems can follow a similar trajectory, provided they have supporting tooling that emphasizes the strengths and counteracts the weaknesses of such systems.

See Andy Matuschak’s writing on how to write good prompts for more.

Pine’s hypothesized memory decline is actually more forgiving because it starts with a linear decay before transitioning to an exponential one. To justify this from a game perspective, it just stripped away too much user agency to see a card’s score decay so quickly and not be able to do anything about it.