6 min readfrom Language Learning

[Research] Help build the first public dataset on personalized vocabulary complexity (Anki users)

Our take

Are you ready to dive into the depths of language learning like never before? We invite Anki users to partake in groundbreaking research that aims to create the first public dataset on personalized vocabulary complexity. Why does this matter? Because existing data often misses the nuance of how learners interact with words and how their memory responds, leaving a gaping hole in both research and practical tools. In just ten minutes, you can contribute to a resource that promises to spawn innovative learning tools tailored to your unique vocabulary journey—think smarter spaced repetition and personalized word recommendations. Your participation is not only vital; it’s fully privacy-compliant and allows you to control what you share. Curious to learn more? Check out the survey [here](https://nekear.me/research) and join a community eager for smarter language learning solutions.

In a world where language learning often feels like navigating through a vast, uncharted sea of vocabulary, the emergence of a public dataset that captures the unique interplay between what learners study and how their memory responds could be a game-changer. The recent call for Anki users to contribute their decks and review logs is not merely a survey; it's a clarion call for a community-driven initiative that could redefine the landscape of language acquisition. Currently, the research domain operates with a frustrating bottleneck: existing datasets either capture the words learners encounter without any insight into memory patterns or vice versa. This gap stymies innovation in learner-facing tools, leaving potential breakthroughs languishing in the shadows. Imagine a world where your language learning experience adapts to your individual memory responses — that’s the promise this initiative holds.

The implications of this research extend far beyond the immediate benefit of creating a valuable resource for language learners. As Anki users contribute their data, they are not just participating in an academic exercise; they are actively shaping the future of language learning technology. The potential outcomes are tantalizing: we could see the advent of AI-driven vocabulary sequencers that adapt to individual learning styles, or smarter spaced repetition systems that tailor schedules based on personal memory patterns rather than generalized averages. Such developments could revolutionize how language learners approach their studies, making the process more intuitive and effective. It’s a little like the conversation around balancing language learning with other disciplines, as seen in discussions like How do you guys balance language learning with learning other things?, where the challenge is often about personalization and adaptation.

Moreover, the commitment to GDPR compliance and meticulous attention to privacy in this dataset collection process speaks volumes about the ethical considerations that underpin modern research. By ensuring that participants can control what they share and how their data is used, this initiative sets a precedent for responsible data handling in the age of digital learning. It acknowledges the concerns surrounding privacy while simultaneously fostering a culture of collaboration that can lead to innovative breakthroughs. The nuanced approach to data collection mirrors ongoing conversations about the intersection of language and technology, as explored in articles like Language in Botany and Math., where the complexity of language is juxtaposed with the precision of scientific inquiry.

As this public dataset takes shape, one can't help but ponder the broader significance of what it means to learn a language in the 21st century. It highlights a shifting paradigm where individual experiences and data become a resource for collective advancement in language education. Is this the dawn of a new era in which personalized learning becomes not just a buzzword but a tangible reality for learners worldwide? The ways in which we engage with language could transform fundamentally, making it more accessible, tailored, and effective. The question worth watching is: How will this collaborative dataset influence the tools and methodologies we use for language learning in the coming years? As the contours of this initiative unfold, it invites us all to stay curious and, more importantly, to stay spooty.

Note: this survey is for people who use Anki to study a language. If you don't use Anki, this one's not for you, but feel free to read on if you're curious.

TL;DR:

The problem: there's no public dataset of what real language learners actually study and how their memory responds to it. Existing data captures either the words without the memory patterns, or the memory patterns without the words. This bottlenecks both research in this area and the learner-facing tools & apps that could come out of it.

What this survey does: it collects both the words people study and how their memory responds to them, from Anki users learning any language - specifically your Anki cards (the words) and review logs (the memory data). Participation takes ~10 minutes, and the survey runs entirely on your device before submission for privacy. You review every card and exclude anything you don't want to share. It is fully GDPR-compliant. The dataset will be released openly so anyone - not just commercial platforms - can build on it.

Survey link: https://nekear.me/research

Below is more information on why this may matter to you, participation, privacy, the purpose of this research, and its novelty - in that order.

Why this can matter to you as a learner

The most immediate benefit is that in just 10 minutes you're directly contributing to research that hasn't been done before, and to a dataset that will become a permanent public resource for the entire language-learning research community.

Longer term, this same research makes a new generation of learning tools possible: - deck recommenders that know which words you're actually ready for; - vocabulary sequencers tuned to your prior knowledge; - smarter spaced repetition schedulers built on personal memory patterns instead of population averages.

And because the dataset will be public, anyone will be able to build them, not just one company.

Who can participate

To make the research outcomes meaningful, the dataset requires its content to follow specific rules.

You're welcome to participate if: - You actively use Anki for language learning; - You have reviewed at least some cards in your decks more than 5 times (this is when review patterns start to reflect actual memory rather than early-stage half-random answers). But submissions below that threshold still help.

What participation looks like

The survey takes about 10 minutes, and the steps are pretty straightforward: 1. Export your Anki deck (.apkg) with the following checkboxes ticked: "Include scheduling information" (the review logs), "Include deck presets" (the scheduler configuration) and "Support older Anki versions"; 2. Open the survey link - it includes a built-in utility that opens your decks fully locally and lets you decide what to submit; 3. Fill out your language proficiency (your known languages affect how you learn new ones) and pick your domains of interest (they shape which words you've likely been exposed to); 4. Review your cards in a preview UI. The utility flags potential personal info (emails, phone numbers, names) for your attention. Exclude anything you don't want shared; 5. Click submit. Nothing leaves your device until this step.

You'll receive a one-time withdrawal token in case you change your mind later.

What's collected and how it's protected

In plain terms: you choose what to submit (and can exclude anything), the survey's built-in tool flags sensitive info to help you catch it, all identifying details about you are removed so you can't be identified as a learner, your data is stored in the EU, and you can withdraw any time after submitting.

A more technical TL;DR: - Local-first review. The survey allows you to see every card/note before submission and exclude any of them individually should you deem it necessary. The tool also flags potential personal information (emails, phone numbers, names). Everything runs locally; - Identifiers stripped or randomized. Your deck names are replaced with meaningless artificial names, all timestamps (e.g., when your card was created) are offset by a random value, and Anki internal IDs are replaced with synthetic counters; - GDPR-compliant. Data is stored in the EU, and is encrypted at rest, with a withdrawal mechanism via a one-way token you keep; - Special-category check. Cards mentioning health, religious, or political content trigger an additional explicit notice under GDPR Article 9.

The full technical schema (every field, what's collected and why, what's transformed, and what's dropped) is accessible here: https://nekear.me/research/data-handling.

About me and the research

My name is Michael. I'm a Master's student in AI at the University of Galway, Ireland, working on my thesis at the intersection of AI and language learning.

Simply put, the research involves training an AI model that predicts how hard a specific word is for you, given the words you already know and your learning patterns. The model is trained on three inputs: - The word's morphological features (what parts it's built from) and distributional features (how often it appears in real-world usage) - that's the reason your cards are collected; - Your performance history on similar words - the reason your review logs are requested; - Your language proficiency profile, because your native and other known languages directly affect how you learn new ones - the reason your language profile is asked.

You can read more here: https://nekear.me/research/data-handling#what-is-collected or ask directly.

Why the research is novel

There's prior work on word-difficulty modeling: Duolingo has published a couple of important datasets in this area (HLR in 2016, SLAM in 2018), but both capture learning within Duolingo's own curriculum: platform-chosen words, platform-formatted exercises, platform scheduling. The publicly missing part is data on what learners themselves chose to study, in any language, scheduled by a memory-faithful algorithm like FSRS, with the full card content intact. As for existing log datasets like open-spaced-repetition (which FSRS was built on), they strip the content out for privacy, while other public vocabulary research datasets don't include memory data. Neither side of what's needed currently exists publicly.

This survey is building the first dataset that has both. Once released publicly, it removes a real bottleneck for anyone working on personalized vocabulary learning.

Beyond the dataset, the research contributes a model that predicts word difficulty by combining two things usually studied separately: the linguistic properties of a word (its morphology and how it's distributed across real usage) and an individual's own memory patterns from their review history. Most prior work treats word difficulty as a fixed, population-level property, while this approach makes it personalized.

Questions / concerns

Comment below, DM me, or email me at hi@nekear.me. I'm genuinely happy to discuss methodology, privacy specifics, or anything else.

Cross-posting note

You may also come across this post in r/Anki, Anki Forums and the Anki Discord #language-learning channel, where I posted / will post with mod coordination. Apologies if you see it more than once. And I appreciate any help spreading the word, as I hope we can make a huge contribution to language learning.

Survey link: https://nekear.me/research

(TL)

submitted by /u/Nekear_x
[link] [comments]

Read on the original site

Open the publisher's page for the full experience

View original article

Tagged with

#creative language use#language evolution#philosophy of language#humor in language#placeholder words#word meaning#Anki#language learning#personalized vocabulary complexity#memory patterns#public dataset#learner-facing tools#review logs#spaced repetition#deck recommenders#vocabulary sequencers#data privacy#GDPR-compliant#language proficiency#export Anki deck