Sitescooper Guide

Jumpstart Guide to using Sitescooper with Plucker or IsiloX


Disclaimer: All contents provided are solely for instruction and reference. The author will not bear responsibility for any loss of data, hardware and software problems, emotional and mental distress or trauma, and anything else that may result from its use. More site files may be found at the sitescooper website or in your Sitescooper site samples directory.

1. Preface

Main programs required - Sitescooper & Perl
Recommended document readers - Isilo (requires IsilocXC for content conversion) or Plucker

This jumpstart guide is non-exhaustive and covers the main procedures in getting Sitescooper to work with Plucker or IsiloX. You should have a basic working knowledge of Plucker or IsiloX before proceding with this guide. The directions for Sitescooper and Perl installation will not be covered in this guide since its straightforward and well-detailed in its accompanied documentation.

There are 2 basic building blocks for scooping - the site files and the Sitescooper command line argument. For further details regarding these topics, please refer to the Sitescooper documentation. The scooping process begins with webpages downloaded and parsed by Sitescooper then stored on the local system storage media (ie. harddisk drive) as standard HTML formatted files. The HTML files are subsequently converted by a document converter application (Plucker or IsiloX). The created .pdb file can then be transferred to the Palm by Hotsyncing.

2. Why use Sitescooper?

It is FREE! That is just only one of the good reasons.

Although document converters, such as Plucker or Isilo, by themselves may have the capability to retrieve webpages, Sitescooper provides features such as increased flexibility and precise control of contents. It has the ability to selectively retrieve relevant information from pages and can also include or omit the download of redundant images, links or other objects. It is Perl language driven and hence provides increased control when parsing webpages.

Since Sitescooper is executed at the command line prompt, it is possible to open multiple instances to optimise the scooping process. Refer to "Concurrent Scooping for Sitescooper".

All the features within Sitescooper work together to greatly reduce the download time of the webpages and processing time for the converters.

3. Quick-Start Procedure

This section is a basic walkthrough to teach you how to prepare and setup Sitescooper for content retrieval for Plucker or IsiloX.

You need to download the Quickstart files before beginning.

1. Install Sitescooper (full version) and Perl. (for this example, use default Sitescooper folder name "sitescooper-3.1.2" in C Drive and default Perl folder at c:\perl)
2. Create a sub-folder named "sites" (ignore quotes) in the Sitescooper folder.
3. Extract Quick Setup files and place according to this order:

i. site file (sti.site) into the "sites" folder. All other site files shall be placed in this folder.
ii. batch files - scoop_plucker.bat (For Plucker users) or scoop_isilo.bat(For Isilo users) into Sitescooper folder.

Note: Check that the folder paths defined within the files reflects your current Plucker or Isilo location.

3. Execute the scoop_plucker.bat or scoop_isilo.bat batch files. The "Straits Times Interactive" pages will be downloaded to the folder "tmp\txt\Straits_Times_Interactive" within the Sitescooper folder and converted to the respective reader's format.

4. The scooped data should be transferred to your Palm upon the next hotsync operation.

Note for IsiloX users: Ensure that IsiloXC is installed. Execute the batch file at least once (disregard the IsiloX command error) and setup IsiloX desktop to convert the index.html file in the Sitescooper tmp/txt folder where the downloaded pages are stored.

Note for Plucker DESKTOP users: All instructions provided here are for the non-desktop version. To use the downloaded pages, create new channels in the Plucker Desktop and point them to the respective tmp\txt\<site_name> folder. Alternatively you may also use Sunrise to convert the downloaded pages.


4. Inside the Quick Setup Files

scoop_plucker.bat

perl.exe sitescooper.pl -mhtml -mplucker -color -maxcolors 256 -nodates -noheaders -nofooters -sites sites\*.site

Note: The -color -maxcolors 256 arguments can be omitted for monochrome Palm users.

scoop_isilo.bat

perl.exe sitescooper.pl -mhtml -color -maxcolors 256 -nodates -noheaders -nofooters -sites sites\*.site

"C:\Program Files\iSilo\iSiloX\iSiloXC.exe" -x "C:\Program Files\iSilo\iSiloX\index.ixl" -u -v

Note: Ensure that path statement for isiloX is correct.

sti.site, dilbert.site, mediacorptv.site

Included site files for Straits Times Interactive, Dibert Comic and Mediacorp TV Guide.

5. In-depth topic

5.1 Site files Creation Primer

Within the Sitescooper folder, there is a sub-folder named site_samples with a large collection of site files contributed by Sitescooper users. These may be readily used or modified to suit your needs. The standard format for a site file is as follows(comments are preceded by a # symbol):

URL: http://straitstimes.asia1.com.sg/avantgo/index/# URL of the site to be scooped
Name: Straits Times Interactive# Title to identify the webpage (also used by Sitescooper for folder name for scooped data)lucker Setup
Levels: 2 # the depth of levels for scooping (value from 1 to 3)

Optional parameters

StoryDiff: 1 # compares data with cache from previous scooping to avoid repeat downloads.
ImageURL: http://http://straitstimes.asia1.com.sg/avantgo/.* # URL of images to be included.*
StoryURL: http://straitstimes.asia1.com.sg/avantgo/story/.* # URL of links to be included.*
StoryStart:
StoryEnd:

* denotes selective parameters. These provide additional control of scooped data and may include Perl regular expressions(refer to "Writing a Sitescooper .site File - StoryURL and Regular Expressions" in Sitescooper documentation).

5.2 Scooping Sites

Basically the site files and the batch file are the main Sitescooper elements required to perform the scooping. Its to be noted that Sitescooper only retrieves, parse and stores the downloaded pages into a directory of your harddisk as raw html pages. You will still need Plucker or IsiloX to format it to the proper .pdb file and transfer them to your Palm via hotsync.

The folder structure for scooped sites should be something like 1 single index.html file, together with as many folders as the number of sites there are. SiteScooper will check all the folders under \tmp\txt and find out all the corresponding html files inside and link them all in index.html. Perhaps you can post the command line that you use to scoop, together with the config you are using. (i.e., are you using the \sites folder or using the site_choices.txt file or both).

5.3 IsiloX Setup

IsiloX Reference Manual
IsiloXC Reference Manual

Tip for IsiloX with expansion support

If auto-install feature in iSiloX is used to install scooped files to the MS automatically, the next Hotsync will take eons to complete. This is due to a "feature" (can't exactly call it bug either) in Hotsync Manager that given a certain prc/pdb file, the sync-ing time depends on the number of records in the file (not so much on the actual file size itself). For e.g., a 1000 record 1mb file will take much longer to sync than a 10 record 1mb file.

To overcome this problem, set iSilo NOT to install the file, instead, get it to save the pdb file direct in your MS via MS Import. That would mean that everytime you iSiloX your files, your Clie must be running MS Import before you start the conversion. A bit of a hassle, but beats waiting for eternity for your next Hotsync.

start /min "C:\Program Files\iSilo\iSiloX\iSiloXC.exe" -x "C:\Program Files\iSilo\iSiloX\index.ixl" -u -v


Guide brought to you by
Molife@blogspot

Related Topics 

How to perform Concurrent Scooping with Sitescooper

Sitescooper Essentials

Sitescooper

Activestate Perl

Plucker

Isilo and IsiloX

Sitescooper Quickstart Files 

Contact

Forward all feedback or comments to Molife@blogspot