Getting File Information

The information on this page is for the older version (1.9.x). For the new version (2.0), check the manual on the download page. Help (text only manual) is available on the application.

With the Corpus File Info tool, you can easily create a summary of corpus files or corpora/databases as well as counting specific words/phrases in the files/corpora/databases.

To use the Corpus File Information tool, select Corpus File Info on the selector.

This tool has four modes.

Basic Info

In Basic Info mode, you can obtain the basic information of each file or corpus/database.

The results table include

    • Type: different words

    • Token: running words

    • TTR (type-token ratio): type-token ration within a file or corpus/database

    • STTR (standardized type-token ratio): TTR per every 1000 words; * indicates the total tokens are less than 1000

    • Ave w Lgth (average word length): average number of letters per word within a file or corpus/database

    • frequencies of 1 ~ 15 letter and 16+ letter words: tokens of n letter words (this can be switched to types of n letter words in Preferences -> File Info)

The counting options are Tokens and Types.

If Types is selected: (Ave w Lgth) is not calculated properly.

Word Freq Info

In the Word Freq Info mode, you can create a full word/n-gram list for each file or corpus/database or word frequency table for specified words. In this mode, CasualConc counts all the words in the files/corpora/databases, and then if some words are specified, the frequency counts for those words are extracted and displayed on the table. This process is slow, but thorough.

First, you need to specify the word/n-gram list from which you create the frequency information table.

From Word Count: words/n-grams are imported from the left table of the Word Count tool. You need to create a word/n-gram list on the left table of the Word Count tool.

With a click of the Import button, the first and the last number of the word/n-gram list appears in the boxes. You can specify certain numbers to use a part of the list. By default, you the upper limit is imposed to 300. You can change the this in Preferences -> File Info.

Also even though the frequencies are counted for all words/n-grams, you can set a specific number of columns to reveal on the table. The default value is 300. You can change the this in Preferences -> File Info. This limitation is implemented because a large number of columns on the table slows down the viewing dramatically. With thousands of columns, it might take a few minutes to move even one column or row.

From File: you can import the list from a plain text file

You can only import a list from a tab-delimited plain text file. You can specify a character code and rows/cols to ignore in case you want to import exported Word Count list.

If you are not sure how the file is structured, you can simply hit a space key to quicklook at the file content. You can also check how the list will be imported by clicking the Check button.

Once the list is imported, you can specify the first and the last words/n-grams to use for the frequency count.

From Import Panel: you can import a list by copy&paste from other applications or type words on the panel.

If you have already imported a list, the following warning message will appear.

After you import a word/n-gram list, you can check what is imported by clicking the Check button.

You can delete selected words from the list.

By default, the words/n-grams appear in the order of the total frequency.

Some options are available for the sorting (sorting by the frequencies in each file/corpus/database) and displaying (normalized frequencies/relative frequencies) the results. See Preferences - File Info for more information.

TF-IDF

TF-IDF is an index to show the importance of individual words in each file/corpus/database (see Wikipedia - TF-IDF for more information).

To calculate TF-IDF properly, uncheck Limit the number of words to import to option in Preferences -> File Info. You might want set a limit to the number of displayed columns on the table.

First, you need to import a word list. You have the same options in this mode as the Word Freq Info mode, but to calculate TF-IDF, it is desirable to import the word list created with the same files/corpora/databases in Word Count (left table).

In Preferences -> File Info, you can set the sort order of the results.

Sort by Sum of all files: the results are sorted by the sum of the TF-IDF values for all the files/corpora/databases

Sort by Each file: the results are sorted by the TF-IDF values in each file/corpus/database; the values in parentheses are the TF-IDF values for each word in each file/corpus/database

Word Group Freq Table

This mode is only available for the Advanced Corpus Handling mode.

You can group the results by each file, each corpus/database, or select file or corpus/database for each corpus/database.

If you choose Mix, you can specify how to group for each corpus/database.

Click Setting.

On this panel, select either File of Corpus/Database for Grouping. You can put label for each file if you select File. Select a corpus/database and click Labeling. This will be reflected in the results.

Then, click Import to import a word list.

On the Word Importing Panel, simply type or copy&paste a list.

You can easily create a frequency table.

You can specify word groups on the Word Importing Panel.

The format is:

LABEL->word1,word2,word3,...

The result will look like this:

If you turn on Lemmatization in Preferences -> Lemma and a lemma list is created, you can lemmatize a simple word list on the Word Importing Panel (see Lemma for more information.

Select words to lemmatize and click Lemmatize.

Sorting and normalization options are available in Preferences -> File Info.

Frequencies per 100,000 words

Percent (%) of the TOTAL