Transforming user-generated content into writing hints in Polygloss

Quick introduction

If you are already familiar with Polygloss, you can skip this section :)

Polygloss is a language practicing app for learners at the intermediate level. Let’s say you’ve done Duolingo for a while, tried reading a few comic books or watched some TV shows with audio and subtitles in your target language, and now have some vocabulary of your own. Congratulations! Perhaps now you feel like it’s time to use it and you should start “getting out there”. Like riding a bike, the secret to getting better at something is doing it more, so communicating with others until you’re comfortable having full conversations is exactly the way to go. Polygloss exists to help you take your first steps at this stage.

It works like this: You get 4 images, pick one, and write something about it to another player. Then they are shown your text with the images, and have to guess which image you picked. The tables are turned, they write, you guess, and the match is over giving you points to unlock more images. That’s it!

Besides the short interactions, the app offers extra features to facilitate learning: getting a translation with Google Translate, saving a text for consulting later, sending and receiving corrections or rewards to each other, etc. One such feature, which is actually my favorite, are writing hints that you can unlock for a few energy points that refill daily. In the next sections, I’ll describe how they are implemented, including how I use Natural Language Processing to create them.

Writing hints

Right at the first user tests, I noticed a few interesting things. The first is that coming up with what you want to write is very hard, even with an image to provide a prompt, and even if your vocabulary is enough for something that will do the trick. Maybe you don’t know how to say “I need a bowl” but you know how to say “I need that thing we use for eating cereal”. But first you need to figure out that what you want to say is “I need a bowl”. When users struggled a bit too much to come up with something, they closed the app, stopping the practice session. A big blocking point was that, previously, Polygloss used to choose an image for you. But maybe you don’t know what to say about that specific image, and something pops up immediately for the image right next to it! So one of the early changes was allowing people to just pick any image they wanted in the lesson, and then I started working on adding writing hints.

German word hints for this image: to serve, food, ready.

Some languages in Polygloss have word hints that you can unlock during a match. The words are usually shown in three categories: the first word is a verb, the second is a noun, and the last one belongs to some other category, like adjectives or adverbs. And they are tailored for you: Polygloss looks at your past matches and prioritises showing you words that you haven’t used yet. This helps you have enough components to form a sentence and teaches you something new.

But how do I define the hints for each image? Currently, there are 120 lesson topics in Polygloss, each with an average of 53 images. That’s over 6,000 images in total. Now consider that the hints would have to be defined for each language. Because a Polygloss match is image-based, the content itself is language-independent. You can play in ANY language, minority language, dialect, conlang, as long as there is someone else to play with you. In the past 12 months, players wrote texts in 122 languages. If I wanted to add hints to all of them, that would be over 700,000 times I’d have to add hints to an image, including for the languages I don’t speak. Oof! To solve this, user-generated content and natural language processing comes to rescue.

Natural Language Processing

The word hints are compiled from all the texts that players wrote in the past, for each image in a language. I take this dataset and pass it through a natural language processing pipeline. This creates the hints dataset that I can import back into the app, and is shown to the user once they pick an image. I’ll explain each step of this pipeline below.

1. Ranking the texts

Since most of the texts are written by people learning the language, they might contain orthographic mistakes. Given their vocabulary size, it’s also very common for beginners to use correct words incorrectly, ones that aren’t necessarily the most appropriate for a given object or scenario. Because of this, the texts are ranked according to the respective proficiencies declared by writers.

Proficiency	Points
native	4
advanced	3
intermediate	2
beginner	1

2. Tokenization

At this step, the texts are split into words, and then we can remove information that is not useful, such as symbols, emojis and punctuation. While this sounds like something that could be solved by simple string replacement, there are several complications once we hit some corner cases. Consider a text that contains it’s , if I remove the ‘, I’m left with it and s, but s is not a word. The easiest way to get the individual words in a text is usually to split a string by the space character. However, many languages don’t use spaces between words! Some examples are Lao, Cantonese, Mandarin, Burmese, Thai, Khmer and, for most cases, Japanese. Typically, people know which word is which through context in those cases. Vietnamese, for example, uses spaces between syllables instead of between words, so you have to know when to take in different blocks together. What about “I live in New York”? Do I really want to treat new and york separately? That means that algorithms will often need some other information about the text, and word segmentation can get endlessly complex depending on your needs. Luckily, an open-source python library for this called spaCy exists.

import spacy

nlp = spacy.load('en_core_web_sm') # Loads this english language model
doc = nlp("It's a potato sentence.") # Processes a sentence
for token in doc:
	print(token)
	# It
	# 's
	# a
	# potato
	# sentence
	# .

Besides tokenization, spaCy can do many other language processing tasks, and supports multiple languages. Working with it has been a bliss and I thoroughly recommend it. As you can see, spaCy did not remove punctuation yet, but in the next section I’ll show how we can resolve this with the information it added to each word during the tokenization process.

3. Part-of-Speech classification & lemmatization

I want to keep the words separated into verbs, nouns, and a third group with all the other kinds of words. Knowing which is which is not a trivial process because many words are ambiguous. For example, take water in I’d like to drink water, please and I should water the plants today. In the first sentence, water is a noun, and in the second sentence it refers to the verb to water. One way to disambiguate this is checking the PoS (this is how I will rever to part-of-speech from now on) of the words around it, and creating a dependency parse tree for the phrases. For example, this verb requires an object, I can’t just end the phrase there, I always need to water _something. _In this case, that is the noun phrase the plants, so that is a big hint that water is acting as a verb here. Sometimes even after doing this, there are multiple possibilities and the final decision is chosen statistically by the most common uses of a word. Luckily spaCy has already implemented this for many languages and I can use it to tag the words.

Lemmatization is the process of transforming the words to their base form, called lemma. For example: going → to go; knives → knife. The reason why this is desired is because if I really want to know how often a word is used, this helps reducing duplicates. So if, for example, going appeared 13 times for an image, but ate appeared 12 and eaten, 11, I might want to consider that the verb to eat appeared 23 times, and suggest it to users before suggesting to go. As you maybe have noticed from the previous example, this isn’t a trivial transformation either. You can use a series of rule-based transformations, but there will often be exceptions turning out wrong. Another way of doing this, is using a lookup table, which, while being correct, will always be incomplete since many words belong to open classes*.

Neither of these processes are 100% accurate. So manually reviewing the final result of the hints is still necessary. For example, the spaCy library often transformed glasses into glass, which is correct when we are talking about drinking containers, but not the thing we use on our face to see stuff. Still, for most cases, it gives me exactly the information I need:

import spacy

nlp = spacy.load('en_core_web_sm') # Loads this english language model
doc = nlp("It's a potato sentence flying.") # Processes a sentence
for token in doc:
	print(token, token.lemma_, token.pos_)
	# it it PRON
	# 's be AUX <- we can use the lemma instead of the original token here
	# a a DET
	# potato potato NOUN
	# sentence sentence NOUN
	# flying fly VERB <- same for the lemma
	# . . PUNCT <- we can use this PoS info to remove this token

* Open class x Closed class words

The ‘class’ of a word means their category, or ‘part-of-speech’, for example: verb, noun, determiner, adjective, preposition, conjunction, etc.

Some classes are closed, meaning the words in that category change very rarely, for example article determiners: a, an, the. Some other kinds of words are open class, which means they keep getting new words and are potentially infinite. Some of these words are just very niche, while others are pretty commonly used but very novel and still to be added to a dictionary. The adjective wireless, for example, probably only started being used when people started conceiving wireless technologies.

4. Word removal

Stop words

Not all words people write in the texts are words I want to account for and suggest it back to other writers. Stop words are words typically considered not very important and that we can ignore. There isn’t a general rule for defining this, but they can be, for example, very frequent words that don’t carry a lot of meaning such as: a, the, but, or and. SpaCy has a list of stop words for some languages, and it also adds a is_stop flag for each token.

import spacy
nlp = spacy.load('en_core_web_sm') # Loads the small English spaCy model
stopwords = nlp.Defaults.stop_words
print(stopwords)

OOV (Out of vocabulary)

Because some words are open class, words with orthographic mistakes might have been treated as novel words and everything seems fine until here. So I might want to check against a dictionary and remove any words which aren’t there. The models for some languages in spaCy have word vectors and, consequently, a vocabulary. During the tokenization stage, the tokens will come with a is_oov tag, that I can use to determine if I want to keep that word or not. If you want to use a smaller model that doesn’t have word vectors for some reason, or is processing a language that doesn’t have them, it’s possible to use external dictionaries such as hunspell. This is possible because spaCy supports adding more stages to its processing pipeline:

import spacy
import hunspell
from spacy.language import Language

class spaCyHunSpell:
	def __init__(self, dic_path, aff_path): # Loads the dictionary files
		self.hobj = hunspell.HunSpell(dic_path, aff_path)
	def __call__(self, doc):
		for token in doc: # Adds a new flag to the token
			token._.hunspell_spell = self.hobj.spell(token.text)
		return doc

@Language.factory("spacy_hunspell") 
def create_spacy_hunspell(nlp, name): # Creates the spacy pipe
	return spaCyHunSpell('en_US.dic', 'en_US.aff')

# Unlike the en_core_web_md model, the en_core_web_sm model
# doesnt have a vocabulary, so `.is_oov` is always true
nlp = spacy.load('en_core_web_sm') 
nlp.add_pipe("spacy_hunspell")

doc = nlp("I will contact attorney general if you do not stop. Thsnks")
last_token = doc[-1]
print(last_token._.hunspell_spell) # Checks if the spelling is correct
# False

Profanity

The other kinds of words I’d like to remove even though they are correct are slurs and other profanities. I don’t forbid users from typing “Oh shit, I forgot my keys”, but I also don’t want to give them some words as a writing suggestion. At the time, I did not find any standard for such a list of words, so I curated a list manually, and used a similar process as the one above to add it to spaCy and tag the dataset.

5. Final steps

Finally, I count the remaining words in each category for each image, multiply by the initial rank I gave to the text it belonged to, and order them. Below you can see a summary of the whole process:

Text examples

Unfortunately, processing the word hints is quite a lot of work. Besides the fact that this process needs to be slightly customized for each language, what happens when I add a new lesson with new images? What if there aren’t language processing tools for me to work with a specific language? Or if learners are using a language for which I didn’t have a lot of data or for which I haven’t done this process yet? In those cases, word hints are usually not available.

Another thing to consider is that the pipeline above is compiled from time to time from a versioned dataset. It is not processed in real time every time a user writes something new in Polygloss. While this is an improvement that I could consider adding to this process, in the meantime, especially given all the other open questions, there are alternatives I could employ. For example, I can also straight-up show a whole text from the database in its original form:

In Catalan: Do you want to dance? Of course I do!

With this feature, as soon as someone writes about any image in any language, that information can be useful to someone else. This raises another set of questions: What if the text is inappropriate? What if it contains mistakes? Which criteria do I use to select which one to show?

No matter how I decide to go about this, the important thing to know is that no process will be perfect. That means it is essential that the interface is pretty clear on the origin of the text. First of all, it’s an example, not something we’re going to copy word by word. If it gives us some idea about what to write or teaches us anything new, that’s already good enough. The app also shows the writer’s username with their proficiency level and the date it was written. That should reinforce the idea that this is a direct human-to-human process, subject to some level of unreliability.

For imperfection to be really OK, though, users must be able to act upon the examples they receive. Through a pop-up menu, they can flag the example as incorrect, report it if the content doesn’t seem alright, or simply replace it with the next one.

Naturally, I want the initial example to have a reasonable quality, so this is how this feature works:

Appropriateness

I ignore texts that have been previously reported, and texts where one of the users have blocked each other. I didn’t add yet a setting to stop your own texts from being seen as an example, outside of the context of a match, but I’ll be adding this option to the main app settings in the next immediate version since this is a privacy concern. While the interactions at Polygloss are pretty short and impersonal, it’s still a better practice to let users control this behavior, specially if they aren’t expecting this type of usage.

Selecting

Some images might have been chosen by many players before. So I start by ranking them by proficiency level and getting examples written by native speakers first, and gradually reduce my standards if I can’t find any.

The last criteria I use is length. From previous experiments, I learned that length is the best metric, after balancing simplicity and effectiveness, to separate texts produced by beginners from texts written by more advanced speakers. While I want to display good examples, I don’t want to accidentally set expectations way too high for what the user is going to do next. Displaying shorter texts first is my way of saying “texts this long are ok!” and avoiding unnecessary stress for learners. In the future, I also want to rank the examples by their award, if they received one, so people can be delighted by something funny, for example.

Correctness

Texts by natives might still be incorrect, or not always available. And I don’t have a running service to check the orthography for every language in existence. My approach right now is to depend on metadata registered by users themselves. I ignore texts that have received a correction during the match that elicited it, or that have been marked as incorrect when shown as an example a certain number of times.

At this point, one might ask: if we can mark a text as incorrect now, why can’t I send the writer a correction? I can do that during a running match, when I have to guess the image based on my partner’s text, why not now? It would also mean more opportunities to receive corrections! But if we think about what is the purpose of this feature to begin with, note that these are examples for writers. Who are the writers that are going to need it? While we can play a match with people from any level (the only forbidden pair is native-native), people who might want to unlock the hints at their writing round are not going to be the ones with an advanced or native command of the language. Actually, I’m tracking this menu choice mostly out of curiosity, but I don’t even expect that many players will flag texts as incorrect. The main benefit of having this option in the menu is so people know that they can, and know that what they are seeing might not be 100% correct, without making them go through some specific training or tutorial to inform it. This is why, when managing a product, choosing the metrics to define the success of a feature will always be tricky. Sometimes it is just as fine to stand by intuition.

The other answer to the question of allowing corrections here is that programmers are lazy and more things = more work. The more complex I made this feature, the longer I would take to ship it. Having the simplest scope as possible means less possibility for bugs and users providing feedback sooner. I’ve also started working on a separate feature for reviewing texts, and that will be a better opportunity to let native speakers correct texts from matches they didn’t participate in. But if the text example feature is successful, I’ll also consider adding more interactions with the examples, like sending awards to the writer.

Wrapping up

If you liked this post, please share it! It would help me a lot as an indie developer 💖

Word hints have been available for a while in English, Spanish, French, German, Portuguese and Italian. The text examples are available for all languages at Polygloss for all Android users, and is currently behind an A/B test flipper in iOS, as of right now, with our latest version: 1.2. If you are learning a new language and haven’t checked us out yet, you can download the app here. I’d love to hear your thoughts about it, just drop a comment below 👇 Cheers! 👋

Etiene is a Software Engineer who is passionate about languages. She recently finished her MSc. in Computer Science researching Computational Linguistics and Computer Assisted Language Learning.