Readability List

Anime difficulty

The main idea for the readability list was to create files that would make it easy to spot the next anime (or manga, games...) that has the closest vocabulary from what you already know, making the transition to a new title easier. Using Morphman, I was able to create benchmarks for popular resources.

On that list, if you're looking for a show to watch, you can simply use the Readability List tab on my spreadhseet. Titles recommended for beginners are in bold.

Howewer these suggestions, are only based on usual recommendations and are subjective. If think they're still valid, because no amount of data can simply put one objective difficulty grade for a show. So I think it's still the best way to go, but I would point out that you should watch something you enjoy regardless of difficulty.

But if you have interest in data like me, I'll go over the numbers that we can use to measure the difficulty of a show.

Issues

Different times, different results

Ideally, you could simply put a grade to each between 1 and 10 based on the data, and call it a day. Unfortunately it just doesn't work.

The main problem is that each shows have different runtimes. More time means more sentences, which means more words. Problem is, it has a huge impact on the data you collect because of how languages work. Think about the first 20 minutes of a show, a lot of new words will show up. Now add another 20 minutes, you'll have more new words, but less than during the first 20 minutes, because you'll reuse some of them.

Point is you can't simply divide the number of new words by 2 to get an average, because it's not proportional.

Using category

So my solution to this has been to organized shows by category. I checked the number of episodes for each shows, multiply it by the runtime of each episodes the get the total runtime of each show. I then used that column and divided shows in category that had a similar running time. So, category 1 is for less than 180 minutes (basically movies), 2 is for show around 10 episodes, then the ones around 20...

Now, when using others columns to sort out the data, you can at least put things in perspective because you know in which category the show is.

Imperfect sources

Each title is based on the export of anki decks, or subtitles merged into one file. For examples, exports will have one line by sentence, but some subtitles will split a sentence in two line so it's easier to read on the screen. So the number of lines is less accurate.

Take Dragon Ball: "んっ!はーっ!はーっ! んーっ!んー…。 んっ!". That's 4 lines of text in the source used. But those lines or words will have no impact on your comprehension whatsoever, yet they'll influence the results, So take the whole process with a grain of salt.

There's no number for accent

Using the vocabulary used in a show is one thing, but you simply can't measure how character speak. Do they speak clearly with good prononciation ? It's one of the most important factor in my ability to understand a show. Because even if a show use really rare words, if you know those instead of the common ones, there's no added difficulty.

Target %

Using Morphman, we can analyze how many words we need to know to achieve x% of comprehension of a show. In this case, comprehension is achieved by knowing the vocabulary used in the show. So Target=85% means that you want to know 85% of the Total Instances (TI). Those words are of course organized by the frequency of that show.

To study

So let's say you have a frequency list for each show. You learn each word, starting by the most frequent and you go down that list, how many words does it take to reach your target of x% ? That's what the column tells you. Grave of the Fireflies has 405 words, and Tsuredure Children has 422. So between the two you may to use Grave of the Fireflies because you will reach 85% comprehension faster.

Average Frequency

But what about those words ? How useful are they when it comes to the next show ? Because they're based on the frequency of the show only. That's why we used a Master frequency list that combines the frequency of each word for each show. This column look at each word you need to reach your target and look at the frequency of each of those words on the master frequency list, making an average.

Grave of the Fireflies as 405 words with an average of 19 461. Tsuredure Children has 422 but an average of 20 233. That means that the words you learn for Tsuredure Children will be more useful in the long term when watching other shows, because on average those words are more common than the 405 of Grave of the Fireflies.

This is the data that I picked to sort show because I think it's the most valuable, but take a look at recommendations, because Tsuredure Children have characters that speak really really fast and is not a good choice to start.

Which Target to choose ?

I picked 85% because you'd know enough to enjoy a show. You can use 90% when you know a bit more, ideally you could probably settle for 95%, which is what you need to follow a show without difficulty. 100% means you'll put even rare words from the show in the mix, which will decrease the average frequency consequently. Show with specific vocabulary will be at a disadvantage.

Other Factors

New Data

There are 4 number that we can collect from a show:

  • Runtime: number of episodes X duration of each show

  • Total Morph: number of individual words used

  • Total Instances: number of words used (morphs are used more than once)

  • Lines: number of line.

Sentence length

Number of Instances divided by number of lines = average instances by line. The shorter the sentence, the easier it should be to understand.

Repetition

Number of words / number of instances: how many times common words are repeated. This one is tricky. For low instances (movies), the lower the ratio the better. Because if only a few morph make up for the whole text, then they're repeated a lot. But for high instances, the higher the ratio the better, because high instances means you'll see a lot of morph showing up only once.

Basically, don't use this, it relies too much on the number of instances to be revelant data.

Dialogue Frequency

Runtime divided by lines. It doesn't necesseraly means speed at which characters are talking, because you can have time when nobody speaks but they do, they'll do it really fast, like in Totoro. But it can be a good indication. If anything, the lower the number, the more time is spent with characters speaking.

Conclusion

I think the most important factor is which show will let you know the more useful stuff. It means that if you follow that order, it's the most efficient way to reach comprehension of the next show by learning the smaller amount of words. Since your skill will increase more by doing immersion than doing anki, spending more time in immersion and less time in Anki seems like a good idea to me.

The other factors would be sentence length I think. Other than that, simply use the recomendations.

Use the readability files, and the morphman tutorial to get your personalized data and enjoy.