Japanese at your level

Introduction

We’ve learned how to count the number of words we know, and now we’re going to use that to find an anime, game, or manga that matches our level.

The idea is simple: we compare vocabulary lists. We look at the words you already know and compare them to the words used in a specific title.

There are several ways to measure difficulty, but the most important factor is whether you’re already familiar with the vocabulary being used. For example, an anime might use rare or unusual words. That might seem harder, but if those are exactly the words you’ve learned, it will actually be easier for you to understand than something more “common” that uses vocabulary you haven’t studied yet.

Analyze

Set Up

Open Morphman’s preferences (Ctrl + A).

There are 3 things to pick :

Input directory : The folder you want to analyze. In this folder, you can have subtitles files (srt, ass), text files or corpus files.
Morphemizer : Pick Mecab for better accuracy.
Known Morph DB : Your database of known words. It will be selected by defaut.

Use the data

Here are the results.

You can click any column header to sort the list, or copy and paste everything into a spreadsheet if you want to sort, filter, or analyze the data further.

Let's go over each column.

Input: The name of the title being analyzed.

Morph refers to individual words.

Total Morphs: The total number of unique words used in the title.
Known Morphs: How many of those unique words you already know.
Known Morphs %: The percentage of known unique words (Known Morphs / Total Morphs).

You’ll see the same three columns again, but with Instances instead of Morphs. The difference is that a single morph can appear many times in a text. Even if you know many unique words, if they don’t appear frequently, your overall comprehension will still be low. This is why it’s recommended to learn high-frequency vocabulary—the words that show up the most.

Known Instances % is the key column to look at. This percentage shows how much of the text you can understand overall.

Total Instances: The total number words used in the title. You can consider it the length of a text.
Known Instances: How many of those words you already know.
Known Instances %: The percentage of known words (Known Instances/ Total Instances).

Young / Mature: Anki divides cards into young and mature. A card you’ve reviewed only once is still young, and after repeated review it becomes mature. MorphMan applies the same idea to words. You can adjust the settings in the config file to decide how long it takes for a word to transition from young to mature. In my case, since I used tags for my cards without reviewing them, all my known words are marked as mature. Personally, I wouldn’t worry too much about these columns. If two titles have the same Known Instances %, the one with a higher Mature Instances % might be easier, but I don’t think it’s a major factor.

Proper Nouns %: The percentage of the text made up of proper nouns. This usually doesn’t matter much when judging difficulty.

Line Readability %: This can be useful. If you analyzed a text file or subtitle file, this tells you how many lines are composed of 100% known words.
I+1 Lines %: “i+1” means a line that contains only one unknown word.

Look at the subtitle on the left. Even though there is a break line, morphman will still analyze this, as one line if your file in srt format.

More Info

Now that you find a tilte that matches your difficulty, you can move on to the next page, but here are some additional information.

Difference between Corpus and text file

Analyzing all the "titles" files at once, take a very long time. Corpus files are basically pre analyzed titles. The information of what words are in which files is already done. So Morphman only needs to compare the results to your database to give you the readability reports.
The corpus files were made without the line data, so if you analyze the corpus files, the data for line will be empty (Line readability %, i+1 Lines %).

Group Directory

You can also enable the “Group by directory” option in the output settings. This creates a summary for each folder instead of listing the details for every individual file (text or corpus).

For example, if you download subtitles for several new shows, you can view the detailed results for each episode, or you can group them to get an overall summary for the entire season or series at once.

Create a corpus

Corpus files are much faster to analyze than individual files, because it saves the data of what words are inside each file. But to create the corpus, you still have to analyze the individual files at least once.

To create a corpus file, go to advance settings, and check the option for "Save Readability Corpus DB". Click Analyze and you the file will be created in your Output directory.

You can rename it if you want, it won't change the data. You can also analyze two corpus files to create a new that combines the data of both.

Page updated

Google Sites

Report abuse