Managing Parallel Corpora

To use CasualPConc, you need to create a database of parallel corpora in File view. To switch to the File view, click File on the top tab. The top two tables are the file list tables and the bottom one is the database table.

Create a database from separate text files for two corpora

    1. First, click Add button and select a file(s) from one corpus to add to the table. You can also drag&drop files to the table.

    1. If you add a plain text (.txt) file, you can select text encoding on the table. The encoding of the added files is based on the Default Encoding.

    1. If you add multiple files, the matched files should be in the same order. You can change the order of the files on the table by simply clicking & dragging the file(s) on the table.

    2. You can Delete a file, Clear the table, and Open the selected file with an Application.

    3. You can check if the sentences/paragraphs in the matched files are aligned. Click a file on one table, which selects a matched file on the other table. Then click Preview button at the upper left corner of the File view.

    4. The Match Preview window appears.

    1. You can scroll down to check if the two files contains the same number of sentences/paragraphs. If you click a matched sentences/paragraphs on the table, the whole sentences/paragraphs appear in the text views. If you find any mismatch, you can open the original file by clicking Open button at the bottom. The file will open with an assigned application (see Preferences).

    2. When you add corpus files to the two tables, click Add to DB button to add the files to the database. If the numbers of files on the two tables do not match, you can't add the file to the database.

    1. The files will be added to a temporary database named Temp. You can delete any matched pair or clear the database.

    1. Now you are ready to explore the parallel corpus.

    2. You can also add plain files to the database using Aligner.

    3. If you want to create a new blank database, click Create New Blank DB button.

Saving/opening a database file

Save

You can save a temp database file for later use. Simple click Save as... button to save the file. The extension is .cpdb.

Open

Click Open DB File button to open an existing database file. If you open a database file, the files in the unsaved temp database will be gone (a warning message appears).

If you want to add files in a database file to the current database, use Import function (see below).

Importing text file(s) with aligned texts

You can import the following files to the existing database.

    • CasualPConc database file - if you want to merge an existing database files to the current database

    • Aligned text file (.txt) - matched pairs of text from two corpora are in a single file (with single line break) and pairs are separated by two line breaks (see example)

    • text from one corpus 1

    • text from another corpus 1

    • text from one corpus 2

    • text from another corpus 2

    • CSV (.csv) separated by comma (UTF-8) - a matched pair of text are in one line separated by a comma

    • CSV (.csv) separated by tab (UTF-8) - a matched pair of text are in one line separated by a tab character (tab-delimited text file with .csv extension)

    • Note, CSV files should be encoded in ASCII or UTF-8.

When you import an .csv file, you will be asked if the format is comma separated or tab-delimited.

This process might be re-worked in the future.

Exporting database content

You can export the text in the database. Click Export button.

The choices are CSV (.csv) separated by commas or by tabs (tab-delimited) or Parallel Aligned Text (.txt). You can also swap the order of two corpora.

Parallel Aligned Text is the format CasualPConc can import (see above). A pair of matched sentences/paragraphs from two corpora are together and separated by a blank line from the previous/next pair.

text from corpus A 1

text from corpus B 1

text from corpus A 2

text from corpus B 2