CasualConc

© 2008-2009 Yasu Imao

CasualTextractor


CasualTextractor is an utility program to extract text data from a PDF file (.pdf), a webpage (with simple browsing), web page files (.html/.htm/.webarchive), MS Word (.doc/.docx), Rich Text (.rtf/.rtfd), OpenOffice Text (.odt), and plain text files (.txt) encoded in the supported encodings and export the extracted text as Plain Text (.txt) or Rich Text Format (.rtf) files.  And CasualTextractor processes all of these files in batch.

This program is also still experimental.  Please back up your original files before you process them and use this at your own risk.  Any feedback is welcome.

System Requirement: Any Mac with Mac OS X Leopard 10.5.5 or later

The current version of CasualTextractor is 0.6.


How to use


PDF

PDF mode has two view panes.  PDF view and Text view.

  1. Click Open button to open a PDF file in the PDF view on the left.  The embedded text will be displayed in the Text view on the right.  If you want to keep font information in the PDF file (to save the extracted text as .rtf), check Keep info checkbox.

    You have basic control of PDF view: Right Click or Control + Click the PDF image to see the choices
  2. Click Format button to delete some line feed characters to make the text paragraph format, which is a default format for CasualConc.  The program looks for lines that end with period (.) and combine other lines with a single space.  After this process, you will need to check formatting because page headers/footers, text headers, etc. are not recognized as a separate line.  Also some abbreviations will be recognized as end of paragraph. If you do not want this, simply skip this process.  You have basic control of Text View: Right Click or Control + Click the text.  You can change font/font settings, but these are not saved (the saved file will be plain text).
  3. Click Save button and select a folder and give a name to the text file.  You can select Plain Text (.txt) or Rich Text (.rtf) as a format. If you select Plain Text (.txt), you can choose encodings (see the list below).

Web

Web mode has two view panes.  Web view (left) and Text view (right).

  1. Enter a web address in the box above the Web view and hit Return/Enter key. You need to add "http://" at the beginning of the address. Currently, you cannot do much with web navigation.  Also javascript that opens a new window does not run.
  2. Alternately, you can drag & drop a Safari webarchive file or an HTML file or even a PDF file or a text file on the Web view.  But if you use Adobe Acrobat Plug-in to view PDF files on Safari, you cannot extract text from a PDF file in Web Mode.  Use PDF mode instead.
  3. Click Extract button to extract text to the Text view or copy & paste text from the Web view to the Text view.  You can edit the extracted text before you save it.  If a text-embedded PDF file is on the Web view, the embedded text will be extracted.  If you want to keep text info, links, etc., check Keep info checkbox (to save the extracted text as .rtf).  Images will also be extracted, but only the links to the images will be saved in a .rtf file.
  4. Click Save button and select a folder and give a name to the text file.  You can select Plain Text (.txt) or Rich Text (.rtf) as a format.  The page title will be the default file name.  If you select Plain Text (.txt), you can choose encodings (see the list below).
  5. CasualTextractor imports your current Safari Bookmarks every time it starts up.

 

Document

Document mode has a single view.

  1. Click Open button to open a text file in the Text View.  You need to select the encoding if you open a plain text file (.txt). Your choices are:

    Unicode (UTF-8) - 8-bit unicode (Mac Standard)
    Unicode (UTF-16) - 16-bit unicode (little endian)
    Windows Latin 1
    Windows Latin 2
    MacRoman
    - Mac OS 9 standard
    ASCII
    Shift-JIS
    (Japanese, Windows Standard)
    EUC-JP (Japanese, Unix Standard)
    ISO-2022-JP (Japanese, email standard)
    ISO Latin 1
    ISO Latin 2


    Other supported file formats (experimental) are:

    Plain Text (.txt)
    Rich Text Format (.rft)
    Rich Text Format with attachment (.rtfd)
    MS Word (.doc)
    MS Word [Open XML] (.docx) Word 2007 (Win) & 2008 (Mac)
    PDF (.pdf)
    HTML (.html, htm)
    Web Archive (.webarchive) from Safari [WebKit]
    OpenOffice (.odt)

    I haven't fully tested all the file formats, but mostly they seem to be working.  If not, you can use the PDF mode for PDF files and Web mode for HTML and Web Archive files.

    If you want to keep text info, links, etc. of the file, check Keep info checkbox (to save the extracted text as .rtf).  Images will also be extracted, but only the links to the images will be saved in a .rtf file.

    You can edit the imported text before you save it.
  2. Click Save button and select a folder and give a name to the text file.  You can select Plain Text (.txt) or Rich Text (.rtf) as a format. If you select Plain Text (.txt), you can choose encodings (see the list above).

 

Batch

Batch mode allows you to process text extraction and saving files in batch.  Text file or Rich Text file will be created for each original file with the same name (extension will be replaced). PDF text formatting is not implemented yet because it is not nearly perfect.

  1. Click Open button to add files to the table.  You can add multiple files/folders.  Supported file types are the same as Document mode. 
  2. Alternatively, you can drag & drop supported files onto the table. 
  3. If you do not want to process some files, simply select the files on the table and click Delete button.  You can clear the table by clicking Clear button.
  4. You need to select the encoding if you process a plain text file (.txt).  The choices are the same as Document mode.  You can change the encodings on the table..
    If you want to keep text info, links, etc. of the original files, check Keep info checkbox (to save the extracted text as .rtf).  Images will also be extracted, but only the links to the images will be saved in a .rtf file.
  5. If you want to save the files with extracted text in the same folder in which the orignal files are placed, check Save to Original Folder.
  6. Select the file format of exported files.  You can select Plain Text (.txt) or Rich Text (.rtf). If you select Plain Text (.txt), you can choose encodings (see the list above).
  7. If you are sure you want to process the selected files, click Process button.  If Save to Original Folder is not checked, CasualTextractor prompts you to select a folder to save the files.