Disclaimer: All contents provided are solely for instruction and reference. The author will not bear responsibility for any loss of data, hardware and software problems, emotional and mental distress or trauma, and anything else that may result from its use. More site files may be found at the sitescooper website or in your Sitescooper site samples directory. 1. Preface Main programs required - Sitescooper & Perl This jumpstart guide is non-exhaustive and covers the main procedures in getting Sitescooper to work with Plucker or IsiloX. You should have a basic working knowledge of Plucker or IsiloX before proceding with this guide. The directions for Sitescooper and Perl installation will not be covered in this guide since its straightforward and well-detailed in its accompanied documentation. There are 2 basic building blocks for scooping - the site files and the Sitescooper command line argument. For further details regarding these topics, please refer to the Sitescooper documentation. The scooping process begins with webpages downloaded and parsed by Sitescooper then stored on the local system storage media (ie. harddisk drive) as standard HTML formatted files. The HTML files are subsequently converted by a document converter application (Plucker or IsiloX). The created .pdb file can then be transferred to the Palm by Hotsyncing. 2. Why use Sitescooper? It is FREE! That is just only one of the good reasons. Although document converters, such as Plucker or Isilo, by themselves may have the capability to retrieve webpages, Sitescooper provides features such as increased flexibility and precise control of contents. It has the ability to selectively retrieve relevant information from pages and can also include or omit the download of redundant images, links or other objects. It is Perl language driven and hence provides increased control when parsing webpages. Since Sitescooper is executed at the command line prompt, it is possible to open multiple instances to optimise the scooping process. Refer to "Concurrent Scooping for Sitescooper". All the features within Sitescooper work together to greatly reduce the download time of the webpages and processing time for the converters. 3. Quick-Start Procedure This section is a basic walkthrough to teach you how to prepare and setup Sitescooper for content retrieval for Plucker or IsiloX. You need to download the Quickstart files before beginning. 1. Install Sitescooper (full version) and Perl. (for this example, use default Sitescooper folder name "sitescooper-3.1.2" in C Drive and default Perl folder at c:\perl) i. site file (sti.site) into the "sites" folder. All other site files shall be placed in this folder. 3. Execute the scoop_plucker.bat or scoop_isilo.bat batch files. The "Straits Times Interactive" pages will be downloaded to the folder "tmp\txt\Straits_Times_Interactive" within the Sitescooper folder and converted to the respective reader's format. Note for IsiloX users: Ensure that IsiloXC is installed. Execute the batch file at least once (disregard the IsiloX command error) and setup IsiloX desktop to convert the index.html file in the Sitescooper tmp/txt folder where the downloaded pages are stored. Note for Plucker DESKTOP users: All instructions provided here are for the non-desktop version. To use the downloaded pages, create new channels in the Plucker Desktop and point them to the respective tmp\txt\<site_name> folder. Alternatively you may also use Sunrise to convert the downloaded pages.
scoop_plucker.bat perl.exe sitescooper.pl -mhtml -mplucker -color -maxcolors 256 -nodates -noheaders -nofooters -sites sites\*.site Note: The -color -maxcolors 256 arguments can be omitted for monochrome Palm users. scoop_isilo.bat perl.exe sitescooper.pl -mhtml -color -maxcolors 256 -nodates -noheaders -nofooters -sites sites\*.site "C:\Program Files\iSilo\iSiloX\iSiloXC.exe" -x "C:\Program Files\iSilo\iSiloX\index.ixl" -u -v Note: Ensure that path statement for isiloX is correct. sti.site, dilbert.site, mediacorptv.site Included site files for Straits Times Interactive, Dibert Comic and Mediacorp TV Guide. 5. In-depth topic 5.1 Site files Creation Primer Within the Sitescooper folder, there is a sub-folder named site_samples with a large collection of site files contributed by Sitescooper users. These may be readily used or modified to suit your needs. The standard format for a site file is as follows(comments are preceded by a # symbol):
Optional parameters
* denotes selective parameters. These provide additional control of scooped data and may include Perl regular expressions(refer to "Writing a Sitescooper .site File - StoryURL and Regular Expressions" in Sitescooper documentation). 5.2 Scooping Sites Basically the site files and the batch file are the main Sitescooper elements required to perform the scooping. Its to be noted that Sitescooper only retrieves, parse and stores the downloaded pages into a directory of your harddisk as raw html pages. You will still need Plucker or IsiloX to format it to the proper .pdb file and transfer them to your Palm via hotsync. The folder structure for scooped sites should be something like 1 single index.html file, together with as many folders as the number of sites there are. SiteScooper will check all the folders under \tmp\txt and find out all the corresponding html files inside and link them all in index.html. Perhaps you can post the command line that you use to scoop, together with the config you are using. (i.e., are you using the \sites folder or using the site_choices.txt file or both). 5.3 IsiloX Setup IsiloX Reference Manual Tip for IsiloX with expansion support If auto-install feature in iSiloX is used to install scooped files to the MS automatically, the next Hotsync will take eons to complete. This is due to a "feature" (can't exactly call it bug either) in Hotsync Manager that given a certain prc/pdb file, the sync-ing time depends on the number of records in the file (not so much on the actual file size itself). For e.g., a 1000 record 1mb file will take much longer to sync than a 10 record 1mb file. To overcome this problem, set iSilo NOT to install the file, instead, get it to save the pdb file direct in your MS via MS Import. That would mean that everytime you iSiloX your files, your Clie must be running MS Import before you start the conversion. A bit of a hassle, but beats waiting for eternity for your next Hotsync. start /min "C:\Program Files\iSilo\iSiloX\iSiloXC.exe" -x "C:\Program Files\iSilo\iSiloX\index.ixl" -u -v | Guide brought to you by |