Generate Anki deck from japanese epub files using LLMs

The present document describes a method to generate anki vocabulary decks from japanese books using LLMs. Traditionally, one would create new anki cards as they stumble upon new words and expressions in books they read. This is a very tedious and annoying task. What could be done instead is to use an LLM to skim through the book first, and automatically generate anki cards that should cover 90% of one's needs.

Extract XHTML chapters from the epub file (e.g. using Calibre)
Extract raw text from the XHTML. Note that japanese epubs may contain furigana. These are normally found within <rt> or <html:rt> blocks and should be removed from the raw text. For example with sed:

cat chapter01.xhtml | sed 's/<rt>[^<]*<\/rt>//g' | sed 's/<html:rt>[^<]*<\/html:rt>//g' | sed 's/<[^>]*>//g' > chapter01.txt

Feed the raw text files and the following prompt to your favorite LLM (customize it as you see fit):

From the attached Japanese short novel text, extract and list vocabulary and expressions for intermediate japanese learner (JLPT N3 and N2). Specifically:

- Output the vocabulary list only
- Write a single entry (word or expression) per line
- Only include words and expression found in the short novel text
- List the items in the order in which they appear in the book
- Ignore names (e.g. 田中さん or 札幌)
- Do not include vocabulary or expressions from JLPT N4 and N5
- Ignore common loan words, such as ペットボトル or カフェオレ
- Write verbs in their dictionary form (e.g. 食べる instead of 食べている)
- Ignore duplicates. For example, 確認する and 確認 are duplicates

Write the vocabulary list above using the following format:
<japanese>;<japanese with furigana>;<english>

Regarding this format:

- There should always be exactly 3 items per line, separated by semi-colon characters
- Furiganas are specified with brackets []. For example: 長[なが]い
- Kanjis annotated with furigana should be prefixed by a space. For example: 巻[ま]き 込[こ]む
- If there is no kanji (so no furigana), the <japanese with furigana> field should be the same as the <japanese> field (see the examples below)

For example:

一分;一[いっ] 分[ぷん];one minute
五分五分;五[ご] 分[ぶ] 五[ご] 分[ぶ];fifty-fifty, even
ゴロゴロ;ゴロゴロ;rolling sound

If the raw text is too long, some LLMs will refuse to work. You may have to split the files into smaller chunks. For example using split:

split -d -l200 --additional-suffix=.txt chapter01.txt chapter01-

You may also try simply copy-pasting chunks of text directly to the prompt. Using Deepseek, I could extract thousands of words and expressions from a short novel:

満ちる;満[み]ちる;to be filled with
混ざり合う;混[ま]ざり 合[あ]う;to mix together
鼓膜;鼓[こ] 膜[まく];eardrum
売れ;売[う]れる;to be sold
物色;物[ぶっ] 色[しょく];to look for, to search for
確認;確[かく] 認[にん];confirmation
納品;納[のう] 品[ひん];delivery of goods
[...]

The output list length and content will of course greatly depend on the LLM and prompt you use. Keep in mind LLMs do make mistakes too (for example here we have 売れ;売[う]れる). You could for example ask several LLMs to generate such vocabulary list, and then merge and clean up. Here is an example of prompt you may use to clean up the vocabulary list:

Process the input vocabulary list below as follow:
- Preserve the input CSV-like format
- Always preserve item order
- Remove duplicates. Nouns, verbs, adjectives directly derived from the same base word are considered duplicates. For example: 確認する and 確認, 売れ and 売れる, 製造 and 製造的な are duplicates. Conversely, compound words are not duplicates. For example 総合計算, 計算 and 総合 should not be considered duplicates. When removing duplicated items, only keep a single occurrence, at the same place than the first occurrence (as order shall be preserved)
- Remove any word/expression which are JLPT level N4 or N5
- Remove common loan words, such as ペットボトル or カフェオレ
- Remove names (e.g. 田中さん or 札幌)
- Make sure verbs are written in their dictionary form (e.g. 食べる instead of 食べている)
- Make sure the furigana annotated japanese expression (second field) corresponds to the japanese expression (first field). Fix the furigana if necessary.

Input List:

INSERT YOUR RAW LIST HERE

Save the final CSV list to a txt file and import the deck in Anki. For this example, my deck has the following fields:

1: Expression
2: Reading
3: Meaning

And the following card templates:

front:

<div class="center">
  <span style="font-family: irohamaru mikami; font-size: 50px;">{{Expression}}</span>
</div>

back:

<div class="card-content center">
  <span style="font-size: 50px;">{{furigana:Reading}}</span>

  <hr id="answer" class="separator" />

  <div class="left">
    <span style="font-size: 30px;">{{furigana:Meaning}}</span>
  </div>

</div>

styling:

@font-face {
    font-family: irohamaru mikami; 
    src: url('_irohamaru mikami.ttf');
}

.card-content {
    display: flex;
    flex-wrap: wrap;
    flex-direction: column;
    font-family: irohamaru mikami;
}

.separator {
    border: 0;
    height: 1px;
    background-color: #ccc;
    width: 100%;
    margin: 16px 0;
}

.center {
    text-align: center;
}

.bottom {
    font-size: 12px;
    text-align: left;
    padding: 4px;
    overflow: hidden;
}

.left {
    text-align: left;
}

.card.nightMode .card-tags-container {
    background-color: #333;
}

Example card:

Is it perfect? Of course not. First, because selecting words worth studying can be quite subjective. Second, because LLM do make mistakes. However, being able to kick-start a deck in just a couple minutes is quite nice!

Bonus - You may of course use LLMs to generate different or more data. For example, I used the prompt below to generate a sample sentence for each expression:

Find below a vocabulary list stored in a CSV like table. The CSV table contains the following columns, separated by semi-colon characters:
1. Japanese word or expression
2. Same but with furigana
3. English translation of the word/expression
4. Example Japanese sentence using the word/expression
5. English translation of the Japanese example

The first 3 columns are already filled. You need to fill the last 2 by generating examples sentences and their translation. Specifically:
- Examples should be idiomatic and highlight most common usage of the word/expression
- Examples should not be too simplistic or too short. For examples, 猫が餌を食べました or 耳が痛い are too simple. Prefer longer sentences. For example: 忙しいからといって連絡してくれないのは怠慢だ, ソースコードを見つけられさえしないのに、開発するなんてもってのほかだ and 最後に出て行ったのは背の高い男で、顔は青白く、つややかな黒い髪をしていた are good examples. 
- Examples should be a single sentence as much as possible.
- Examples should be as diverse as possible. Try not to use the same grammar/patterns over and over.
- Verbs may be conjugated.
- Do not add furigana to the examples
- Avoid using rare or complex words in the examples (beside the one the example is about)
- Since the semi-colon character is used as column separator, this character should not appear in the examples.

Input data:

INSERT RAW CSV DATA HERE

Example output:

満ちる;満[み]ちる;to be filled with;夜の森は蛍の光で満ちていた。;The night forest was filled with the light of fireflies.
混ざり合う;混[ま]ざり 合[あ]う;to mix together;異なる文化がこの都市で見事に混ざり合っている。;Different cultures are wonderfully mixed together in this city.
鼓膜;鼓[こ] 膜[まく];eardrum;ロックコンサートの爆音が鼓膜を震わせた。;The blast of sound at the rock concert vibrated my eardrums.
物色;物[ぶっ] 色[しょく];to look for, to search for;彼は古本屋で絶版の小説を物色していた。;He was searching for an out-of-print novel in the used bookstore.
確認;確[かく] 認[にん];confirmation;書類に誤りがないか最終確認を行ってください。;Please perform a final confirmation to ensure there are no errors in the documents.
納品;納[のう] 品[ひん];delivery of goods;注文した事務用品の納品は明日の午前中になる予定だ。;The delivery of the office supplies we ordered is scheduled for tomorrow morning.

VRichardJP/jpepub2anki.md

Select an option

No results found

Select an option

No results found

Generate Anki deck from japanese epub files using LLMs