How many Chinese characters do you need to learn to be able to understand normal texts written for native speakers?
2,000? 3,500? 5,000? 7,000?
This question is hard to answer for two reasons. First, in modern Chinese, meaning is conveyed mostly through words that have two or more characters, so simply knowing a number of single characters is not enough. Second, it depends on what you mean by “understand” and “normal”.
3,500 to 5,000 characters will enable you to read most texts written for native speakers
The answer lies somewhere between 3,500 and 5,000, depending on what you read and how much you want to use a dictionary to look up unknown characters. This assumes broad knowledge of words that can be built out of the characters you know, as well as grammar, of course.
In this article, I will probe the upper bound for practically useful character knowledge from the perspective of a second language learner. If you are still working your way through standard textbooks or the HSK, this article probably isn’t for you! If you’re into characters, at least partly for their own sake, and want to move beyond learner materials, then this article is for you.
A closer look at the question of how many characters you need to know
There are several statistical analyses of the question of how many characters you know floating around on the internet, but none of them are very helpful. For example, some say that learning 2,000 characters will enable you to recognise 98% of written text. This sounds great, but if you have actually learnt 2,000 characters, you will know that this does not mean you understand 98% of normal texts for native speakers (newspapers, novels and so on).
What it feels like to know 3,500 characters vs. knowing 5,000 characters
Let’s ditch statistics and go with actual experience. I have kept close track of how many characters I know throughout my fifteen years of learning Chinese, so I can share what knowing a certain number of characters feels like when reading texts in Chinese. Obviously, I didn’t only learn single characters, but also words.
Here are a couple of subjective and very rough reference points:
- 3,500 characters is a good goal for being able to read most texts without having to use a dictionary all the time. Having learnt thousands of characters, you’ll often be able to guess the meaning of unknown characters in context. They are rare enough and embedded in text you otherwise understand well. If you can’t guess, it’s unlikely to impede your reading significantly.
- 5,000 characters is a good upper limit for students focusing on modern Chinese. Characters beyond this tend to be used very rarely, often in names of people, places, animals or plants. I originally pushed to 6,000 characters, but have since gotten rid of most beyond 5,000 simply because they really don’t feel useful at all.
Learning more characters than is good for you
Focusing on the sheer number of characters you know is not advisable for most students. There isn’t much practical value in learning characters beyond 3,500 unless your Chinese is already very advanced and you pick them up through copious amounts of reading. There is no practical value in learning characters beyond 5,000. For most students, the best approach is to learn anything you come across at least three times. You will have to read a lot to get to 5,000 characters this way!
Still, some people just like learning characters. That’s okay! To my knowledge, the person who has learnt the most characters in Skritter is Emil Persson, whom I interviewed about his feat of learning 10,000 characters in Skritter back in 2015. You might not want to go that far,but what if you want to go well beyond HSK or any other learning material created for foreigners?
Learning a lot of characters in Skritter
As usual, if you’re aiming for practical relevance, you should get your characters and words from the Chinese you read. If you read a lot of Classical Chinese, for example, you will find tons of characters beyond the 5,000 range. Even if you read lots of modern Chinese, you will find many characters you don’t know. Learn those first.
If you for some reason just want a list of lots of characters sorted by frequency, I just published a deck with almost 10,000 characters sorted by frequency. This deck is based on 现代汉语单字字频: Character frequency list of Modern Chinese by Jun Da at Midwestern State University.
Check out the list here: The 9933 most common Chinese characters in order of frequency
Before you dive in and start collecting the rarest butterflies, it’s important to have some idea of how the list was created.
Five things to keep in mind when going beyond 5,000 characters
First and foremost, this deck works really well for anything up to 5,000 characters
These occur often enough in modern Chinese that meaningful frequency data can be found. Thus, you can use the deck to find characters you maybe should know, but don’t. Let’s say you know 3,800 characters, and there are a bunch of characters in the 2,500 to 3,000 range you don’t know, then learning those will have a positive effect on your reading ability.
This is one of the few cases where I think learning single characters directly from a list like is advisable for most students, i.e. when you use it to plug holes in your character knowledge well within your current level.
Second, the order of characters beyond 7,000 is not reliable
This is because the corpus is not big enough to sort these characters among themselves. You need a truly humongous set of texts to find hundreds or thousands of occurrences of 岍, the name of a mountain in Shaanxi, rank #9783 in the frequency list.
As it happens, this character was only found once in the corpus, and that’s true for many of the rarest characters. If the single text this character occurred in had been left out and another included instead, the list would have contained a different character used to point to a different mountain somewhere else.
Third, the corpus used here is modern, simplified Chinese, but it still contains traditional characters
It’s important to realise that only a couple of thousand characters were simplified, which means that when you get to very rare characters, talking about simplified and traditional characters becomes a bit pointless.
You will also find traditional characters in simplified texts, although at a much lower rate. For example, 门, “door; gate”, is very common at rank #185, but the traditional version of the same character, 門 is also included in the list at rank #6028. While I can’t say for sure, it’s not unreasonable to think that simplified texts talking about the origins of Chinese characters or about character simplification will mention the traditional form. This is enough to make the traditional 門 more common than some really obscure characters, even though all the texts are in simplified Chinese. For example, 铑, “rhodium”, is at rank #6031, which makes it less common than the traditional character 門, even though it’s clearly simplified.
Fourth, we don’t have manually audited data for the rarest characters in Skritter
This means that we have to rely on open-source information regarding pronunciation and meaning. As you can see in the original list (which uses CEDICT), lots of characters are missing pronunciation, definition or both.
Please note that Skritter uses the syllable “chua4” as a placeholder for unknown pronunciation. You will see this reading a lot if you go through the later sections of the deck, but now at least you know this isn’t a weird fluke from Middle Chinese phonology or something. Feel free to submit corrections for characters if you research them and want to add their actual pronunciations and/or definitions.
Fifth, some characters are rare enough that they aren’t covered by standard CJK fonts
You need a font that can handle Unicode extensions (extension B in particular). These characters can’t currently be displayed in Skritter and so they are missing from the deck, which is why there aren’t exactly 200 characters in each section. There’s a total of 29 missing characters listed here
, , , , , , , , , , , , , , , , , , , , , , , , , , , ,
Note: If you see squares, question marks or nothing at all, this is because your browser font can’t handle these characters.
Learning a lot of traditional characters in Skritter
If you want to focus on traditional characters instead, you can use the lists I created for this purpose. These cover 5568 characters and are separated into nine decks roughly matching the nine grades in compulsory education in Taiwan:
- 台灣通用字彙(第一級)
- 台灣通用字彙(第二級)
- 台灣通用字彙(第三級)
- 台灣通用字彙(第四級)
- 台灣通用字彙(第五級)
- 台灣通用字彙(第六級)
- 台灣通用字彙(第七級)
- 台灣通用字彙(第八級)
- 台灣通用字彙(第九級)
I actually went through and learnt every single character in these lists seven or eight years ago, arriving at a total of 6,000 characters in Skritter, counting some character components and extra characters I learnt elsewhere as well. I have since cut down the number to about 5,000, as most of the latter half of level 9 has little practical benefits for someone like me who reads mostly modern Chinese.
How many characters do you know? What did you learn beyond the characters themselves?
If you know of other interesting decks in Skritter or lists outside of Skritter, please leave a comment! Likewise, if you’ve learnt many characters (you can choose your own definition of what that means), it would be great to hear what you think and if you have any suggestions for other learners who want to follow in your footsteps!