CasualTextractor is an utility program to extract text
data from a PDF file (.pdf), a webpage (with simple browsing), web page
files (.html/.htm/.webarchive), MS Word (.doc/.docx), Rich Text
(.rtf/.rtfd), OpenOffice Text (.odt), and plain text files (.txt)
encoded in the supported encodings and export the extracted text as Plain Text (.txt) or Rich Text Format (.rtf) files. And CasualTextractor
processes all of these files in batch. This program is also still experimental.
Please back up your original files before you process them and use this
at your own risk. Any feedback is welcome. System Requirement: Any Mac with Mac OS X Leopard 10.5.5 or later
The current version of CasualTextractor is 0.6.
How to use
PDFPDF mode has two view panes. PDF view and Text view.  - Click Open button
to open a PDF file in the PDF view on the left. The embedded
text will be displayed in the Text view on the right. If you want to
keep font information in the PDF file (to save the extracted text as
.rtf), check Keep info checkbox.
You have basic
control of PDF view: Right Click or Control + Click the PDF image to see the choices
- Click Format
button to delete some line feed characters to make the text paragraph
format, which is a default format for CasualConc. The program looks
for lines that end with period (.) and combine other lines with a
single space. After this process, you will need to check formatting
because page headers/footers, text headers, etc. are not recognized as
a separate line. Also some abbreviations will be recognized as end of
paragraph. If you do not want this, simply skip this process. You have
basic control of Text View: Right Click or Control + Click the text.
You can change font/font settings, but these are not saved (the saved
file will be plain text).
- Click Save button and select a folder and give a name to the text file. You can select Plain Text (.txt) or Rich Text (.rtf) as a format. If you select Plain Text (.txt), you can choose encodings (see the list below).
WebWeb mode has two view panes. Web view (left) and Text view (right).  - Enter
a web address in the box above the Web view and hit Return/Enter key.
You need to add "http://" at the beginning of the address. Currently,
you cannot do much with web navigation. Also javascript that opens a
new window does not run.
- Alternately,
you can drag & drop a Safari webarchive file or an HTML file or
even a PDF file or a text file on the Web view. But if you use Adobe
Acrobat Plug-in to view PDF files on Safari, you cannot extract text
from a PDF file in Web Mode. Use PDF mode instead.
- Click Extract
button to extract text to the Text view or copy & paste text from
the Web view to the Text view. You can edit the extracted text before
you save it. If a text-embedded PDF file is on the Web view, the
embedded text will be extracted. If you want to keep text info, links, etc., check Keep info
checkbox (to save the extracted text as .rtf). Images will also be
extracted, but only the links to the images will be saved in a .rtf
file.
- Click Save button and select a folder and give a name to the text file. You can select Plain Text (.txt) or Rich Text (.rtf)
as a format. The page title will be the default file name. If you
select Plain Text (.txt), you can choose encodings (see the list below).
- CasualTextractor imports your current Safari Bookmarks every time it starts up.
DocumentDocument mode has a single view.
- Click Open
button to open a text file in the Text View. You need to select the
encoding if you open a plain text file (.txt). Your choices are:
Unicode (UTF-8) - 8-bit unicode (Mac Standard) Unicode (UTF-16) - 16-bit unicode (little endian) Windows Latin 1 Windows Latin 2 MacRoman - Mac OS 9 standard ASCII Shift-JIS (Japanese, Windows Standard) EUC-JP (Japanese, Unix Standard) ISO-2022-JP (Japanese, email standard) ISO Latin 1
ISO Latin 2
Other supported file formats (experimental) are:
Plain Text (.txt) Rich Text Format (.rft) Rich Text Format with attachment (.rtfd) MS Word (.doc) MS Word [Open XML] (.docx) Word 2007 (Win) & 2008 (Mac) PDF (.pdf) HTML (.html, htm) Web Archive (.webarchive) from Safari [WebKit] OpenOffice (.odt)
I
haven't fully tested all the file formats, but mostly they seem to be
working. If not, you can use the PDF mode for PDF files and Web mode
for HTML and Web Archive files.
If you want to keep text info, links, etc. of the file, check Keep info
checkbox (to save the extracted text as .rtf). Images will also be
extracted, but only the links to the images will be saved in a .rtf
file.
You can edit the imported text before you save it. - Click Save button and select a folder and give a name to the text file. You can select Plain Text (.txt) or Rich Text (.rtf) as a format. If you select Plain Text (.txt), you can choose encodings (see the list above).
Batch Batch
mode allows you to process text extraction and saving files in batch.
Text file or Rich Text file will be created for each original file with
the same
name (extension will be replaced). PDF text formatting is not
implemented yet because it is not nearly perfect.  - Click Open button to add files to the table. You can add multiple files/folders. Supported file types are the same as Document mode.
- Alternatively, you can drag & drop supported files onto the table.
- If you do not want to process some files, simply select the files on the table and click Delete button. You can clear the table by clicking Clear button.
- You
need to select the encoding if you process a plain text file (.txt).
The choices are the same as Document mode. You can change the encodings on the table..
If you want to keep text info, links, etc. of the original files, check Keep info
checkbox (to save the extracted text as .rtf). Images will also be
extracted, but only the links to the images will be saved in a .rtf
file. - If you want to save the files with extracted text in the same folder in which the orignal files are placed, check Save to Original Folder.
- Select the file format of exported files. You can select Plain Text (.txt) or Rich Text (.rtf). If you select Plain Text (.txt), you can choose encodings (see the list above).
- If you are sure you want to process the selected files, click Process
button. If Save to Original Folder is not checked, CasualTextractor prompts you to select a folder to save the files.
|
|