Kono Yo no Hate de Koi wo Utau Shoujo YU-NO:Frequency statistics

From TLWiki
Jump to: navigation, search

Below is a compilation of frequency statistics on all of YU-NO's text. Statistics were done on the entire script without any special handling of copypasta.

yuno_kanji.txt
This is just a simple count of kanji. Should be self explanatory.

yuno_morph.txt
This is a count of morphemes, given by running MeCab on the entire script. Some filtering was done on part-of-speech so that only nouns, pronouns, verbs, adjectives, adverbs, conjunctions, and interjections are counted.

yuno_vocab.txt
This is a count of words, where words here are anything that appears in EDICT. To generate this, MeCab is first run on the entire text, primarily to get part-of-speech information and verb/adjective de-inflections. But MeCab breaks things down into morphemes, so they need to be re-combined into full words for better matching against EDICT. To do this, when a morpheme is come across, first it combines it with the next 4 morphemes (5 total), then the next 3, and so on until a match in EDICT is found (initial starting depth of 5 here was chosen arbitrarily). The same part-of-speech filtering in the above MeCab example is applied to the initial morpheme in the sequence to trim down the number of results. Also, in combining them, only the last morpheme is de-inflected, the rest keeping their exact form as they appear in the text.