word-cumulative

自然言語処理系NLTKで、Wall Street Journalに出てくる単語の累積頻度をプロットしてみます。

テキスト中に含まれる各語彙の頻度を”頻度分布”といいます。

各語彙の頻度を、順次累積して足し合わせることで得られる分布を"累積頻度"と呼びます。

Pythonの起動

$ python

NLTKのインポート

>>> import nltkl

各種コーパス（本）のロード

>>> from nltk.book import *

*** Introductory Examples for the NLTK Book ***

Loading text1, ..., text9 and sent1, ..., sent9

Type the name of the text or sentence to view it.

Type: 'texts()' or 'sents()' to list the materials.

text1: Moby Dick by Herman Melville 1851

text2: Sense and Sensibility by Jane Austen 1811

text3: The Book of Genesis

text4: Inaugural Address Corpus

text5: Chat Corpus

text6: Monty Python and the Holy Grail

text7: Wall Street Journal

text8: Personals Corpus

text9: The Man Who Was Thursday by G . K . Chesterton 1908

text7:Wall Stree Journalから頻度分布の作成実施

>>> dist = FreqDist(text7)

テキスト内単語数の表示

>>> dist

<FreqDist with 100676 outcomes> おおよそ総単語数１０万。

頻度分布を累積してプロット表示。（＝頻度分布上位50件を累積し、表示）

>>> dist.plot(50, cumulative=True)

単語総数、約１０万の内、５０単語で全体の４５％を占めていることが上図で分ります。

Page updated

Google Sites

Report abuse