Readability List

Studying

Combined with Morphman, the readability files will help you do a lot of things for your studies:

  • Figure out how much vocabulary you know of each anime, game ....

  • How many words you need to learn to reach x % of readability, or comprehension.

  • Prioritize those words to learn by frequency in Anki automatically.

To do this, you'll need the files that I've put together, here's the link.

HOW IT WORKS

Comparing vocabulary

The idea is simple, you compare the vocabulary you know with the vocabulary you want to analyze, Morphman will tell you how much there is in common.

There's a lot of data, but there's basically 3 numbers that you want to look at. Hopefully this picture will help you visualize the results.

  1. The text is 5 lines of dialogues.

  2. The text is 15 words long (called instances)

  3. The text has 6 different words (only 6 words are used to make the 15 words of the text).

How it looks in Morphman

Keep in mind that your results depend on your Morphman settings. In my case I checked the option to ignore what's in brackets and treat proper nouns as know.

In this case, my database is still at 0, meaning I haven't review any cards yet, and I don't know any words.

  • The movie A silent name has a total of 1141 different morphs. I know 8 of them, so that's 0.70%

    • It's 8 when it should be 0 like my database, because in my settings I checked the option to treat proper nouns as known.

  • The text has 7214 morph in it, I know 14 of them. That means 0.19% of the text is just names. (I put most of them in brackets, so they're not counted).

    • Among those 0, 0% are counted as Young (you just saw them) and 0.19% are counted as Mature (fully known).

  • As you can see Proper noun % is the same result, because there's only names that I know from this text. 0.19%

  • There is no line where I know all the words, so that's 0%.

    • +1 lines means there is only one word in the line that I don't, here's it's 7,80%.

If I want to know everyword of the movie, I need to learn 1141 words, minus the 8 that I already know.

What are the readability files ?

Titles

  • There's one file for each anime, for each game... You can see on the pictures that I sort them out by type of media. In each folder, you then have a list.

  • You can use each file to create a frequency list, so you'll prioritize learning the words from the files you analyzed.

Corpus

  • Analyzing all the "titles" files at once, take a very long time. Corpus files are basically pre analyzed titles. The information of what words are in which files is already done. So Morphman only needs to compare the results to your database to give you the readability reports. What could take more than one hour takes a few seconds.

  • Use this to know the readability % of each title. They're in different folders depending on the medium (anime, games...).

Database

This folder was created for those who don't use Anki and Morphman but other resources like Wanikani or Genki. So they're premade database (K and V numbers) based on the vocabulary your learn from those resources.

Master Frequency Lists

This is to know how common a word is in japanese. With Morphman you can make sure that you don't spend time learning words that used in only a few show. It's useful considering that the first hundreds words represent represent a huge percentage of all the words used. So learning those few hundreds first will improve your comprehension of a lot of titles at once. The idea is that you then specialized in one genre like Slice of Life, or science fiction.

How the files were created

Settings

Input Directory

  • Input directory is the directory you're going to analyze. Morphman will analyze anything that's in it, so I made an empty one called Analyse within the readability archive. I usually copy & then paste files in that directory depending on what I need to do, and delete them when I'm done.

  • Select Japanese MecabUnidic as your morphmemizer.

General Settings

  • In the general settings you can pick your known dababase. By default, it's called known.db. My files have premade database if you don't use anki, that's where you get to choose which one to use.

  • You can also select the directory for the files you're going to produce, by default, they'll be in your db directory.

Outputs

  • By default, any files will be analyzed in alphabetical order, but you can also use the option "Group by directory".

  • Line stats will give you data for the line readability % and i+1 lines columns seen above. They really slow the process down, and they're not the most revelant data so, I usually uncheck that box. The corpus files were made without that data, so if you analyze the corpus files, the data for line will be empty (Line readability %, i+1 Lines %).

Create a corpus

Corpus files are much faster to analyze than individual files, because it saves the data of what words are inside each file. But to create the corpus, you still have to analyze the individual files at least once.

To create a corpus file, go to advance settings, and check the option for "Save Readability Corpus DB". Click Analyze and you the file will be created in your Output directory. You can rename it if you want, it won't change the data. You can also analyze two corpus files to create a new that combines the data of both.

Create a Frequency List or a Master Frequency List

Create the reports

A master frequency list will be made to filter uncommon words, so you can prioritze on the most common and useful words.

When you create a frequency list, each word will receive points depending on how often they are in the title or corpus you're analyzing but also their frequency number in the master frequency list.

Check the "Write Word Report" boxes. You'll create two files:

  • instance_freq_report is how many times a word appears.

  • morph_freq_report is in how many files in word appears.

When using a master frequency list for your study plan, it's probably best to select the instance one. There are premade master frequency list in my archive.

Below, you can see that words are ordered by frequency. The most common word that appears in "A Silent Voice" is だ, 384 times. Because we only analyze that movie / file, on the right you can see that the frequency for each word is 1. Because they appear in only 1 file.

Reading the report

Each file has the same columns.

  • Frequency, how many time the word appears.

  • The word, written in kanji, common form, katakana.

  • Nature of the word (noun, verb ...)

  • Order. You can see in the last line on the left that both が and の have the same frequency of 98. So they're tied as 12th. Next column is regular count.

  • % of the text it represents. The total instances was 7214, so 314/7214=0,0532 or 5,32%. だ represents 5.32% of the whole text.

  • The next is the same but cumulative, by adding the second row: 5.32 + 3.70 = 9.02 %. だ and た represents 9.02 of the whole text.

  • The last column is imply to know if you the morph is in your known.db, matches 0 means no, matches 1 means yes.

Why you should use a Master Frequency List

Hope you like math... Check out these curves... these are the graphs for A silent Voice and Anime instances (250 titles).

It shows how many words you need to achieve Y% of comprehension.

As you can see, learning just a few words will improve you comprehension greatly at first, because the most common words are very common. But after a while, each new learned word will have less of an impact on your comprehension.

That's why you should absolutely start with the most common words.

To be specific (with anime instances)

  • 50%: 53 words

  • 75%: 634 words

  • 80%: 1156 words

  • 85%: 2142 words

  • 90%: 4322 words

  • 95%: 9956 words

What you can learn from this, is that it's beneficial to focus on different anime that will teach you the same vocabulary, so that learning words from one show will help you understand the next one. Since Slice of life anime usually have every day vocabulary and character speak "normally" and slowly, they're usually recommended as your way in.

Using the Files

Find your next title to enjoy / study

Let's finally get down to buisness. The first thing we're going to do is find a title, that's close to what we already know.

In my spreadhseet, I have two tabs, one for the readability list and one for anime difficulty. So which one to use ? Neither. The anime difficulty is simply here for curiosity and find some shows that could be easy. But like explained in the next page, putting a single grade for difficulty simply doesn't make sense. No offense to other website who do this. One thing we can actually measure though is how useful the most frequent vocab of a show is. That's the anime difficulty tab. But Shirokuma Cafe for example is a good anime to start with and it sits at rank 180 at the time I'm writing this, so ... You can find a combination of 3 factors, useful, recommended, and close to your knowledge.

After you've watched / studied a couple, I'd advise to simply go for what's close to your level and that you're interested in.

So, simply select the corpus you want and analyze it. Below is the results after finishing the Core Anime Deck. You can click on the headers to reorder titles.

You should order the results by Known Instances %.

I'm going to pick a Silent Voice, since the characters speak more slowly than in Toki wo kakeru Shoujo and it's also a movie, so it's short.

Make a Study Plan

Since I'm using A Silent Voice (koe no katachi) for my next study, I want to learn the most frequent of that anime first so I can increase my understand of that anime faster. So I copy the individual text file and paste it in the analyze directory to analyze it.

Check the BOTH boxes for Study plan and Set Frequency List, or it won't work.

Use the Target %

The target% is the number you need to reach with Known Instances %, basically, how much of the text you'll be able to understand. It's up to you to decide which number to go for.

You can see below than my actual Readability is at 79.51% and when I'm done with my study plan, learning 786 new words, I'll be at 100.00%.

  • Avg Study Freq is the average frequency for each of those 786 words represent on the master frequency list.

  • New Master Freq % is how much how the Master Frequency List I'll know by the end of the study plan.

This is using the Anime-Instance Master Fequency.

Use the master frequency list

There are a few depending on what you want to prioritze. Pick one.

If you want what has the most diversity, use the netflix one. If you don't care about words used in reality tv and just want to focus on words used in anime... pick the anime one.

Without using a master frequency list you'll just learn the most common words of the titles you analyze, but maybe some of the words are not very common when it comes to anime in general. Let's say you want to study with Pokemon. A lot of pokemon will be frequent in Pokemon, but never used in other anime, so you want to use a Master Frequency List to filter those words out of your study plan.

Which number for the master frequency list ?

*The numbers used in those examples may change with updates.

Open one of your list. Let's say you want to focus on the most 2000 common words. Go to the line 2000 (I use Notepad ++ for this).

Sitting in that postition is "カツラ". It has a frequency of 486. That's the number you use in Minimum Master Frequency.

Any number that has a frequency of 486 or above will be available for the study plan. But you won't learn words that have a frequency below 486.

If we analyze again, in the study plan we're down to 451 words to learn instead of more than a 786.

You also see that we'll only reach 93.68 comprehension. That's because the words needed have lower frequency than 486 on the master frequency list. To reach higher readability %, we'll need to pick a number below 486.

Which number for the target % ?

The research indicates that we need 95% of readability to have a good comprehension. However you don't need to fully understand everything at first.

Based on my experience, 75% was enough to be a good experience. I would pick up a few sentences here and there, recognize and reinforce vocabulary that I already knew. I needed subtitles to follow along but 75% will be beneficiary I think. My method was to reach 75%, then watch the next movie or episode with 76%, increasing the target a little bit more each time.

You can start at 75% at first, but ideally you want to be between 85 and 90% to really understand what's going on.

That's just me though, I praticaly never watched more than 10 anime before learning japanese so I really needed to train my ear to the sounds of the language anyway. I focused on learning a few words by anime and then moving on, rather that keep studying the same material. It's up to you to find the right balance.

If you want to reach a higher target%, you need to lower your Minimum Master Frequency number.

Finding the right numbers is up to you, you need to find a balance between the number you need for pleasant viewing experience while focusing on the most common words.

Study plan for multiple titles

So far we only analyzed one title, but a good idea is to analyze a few at once, like for a tv show.

To to this, download subtitles from https://www.kitsunekko.net, and place them in your usual folder.

What's great about it is that will order cards just like before, but now, it will take also take into account which episode each word appears in. So you also learns words in order of appearance.

I picked the subtitles for Natsume, becomes I found it to be easy. It's slice of life, but with some fantasy that kept my interest a bit. It's lovely.

Because we don't know much, most of you'll learn will be in episode 1. . Some like to set a high target % on shows. It makes understanding each next episode really easy. It's called deep-diving.

  • You can see that the morph readability starts at 3.74. I need to learn 128 new words and my new morph readability reaches 75%.

  • It then goes to episode 2, where you'll need to learn 35 new words to reach 75%. Morphman takes into account that you learned the previous 128 at this point.

  • 6 new morph for episode 3. You can see that you have cumulative Morphs, letting you know how many total morphs it will be for all the episodes.

  • For episode 4, you already have 75% comprehension; 76% even. So you're not learning anyword from that episode.

  • Maybe once you reached episode 3 you increase the target to 77 for episode 4, learn a few words. It's up to you.

Don't forget that after using the analyzer, you must recalc your database for the files (Study plan + frequency) to take effect on your cards.

Final word

This is the study plan text file. You can see that you have the list of morphs, listed by episode.

Each line is also the order in which you're supposed to learn your cards.

Morphman need to have a 1T card available for you to learn these words in this order. So do recalc to update your database after at least each day at first to make new cards available.

If there's no 1T card available, Morphman will simply use the frequency of the cards you currently have in your collection before another +1 card becomes available.

Consider using recalc after each study session.