Automatic detection of mails moved from/to the spam folder

This script work but is unmaintened. There's an issue which can tend to damage the automatic learning. It should not drop your mails, but that might generate bad classification or nuke your dspam learning base.

I'm currently using dspam to filter my mails. However, as I'm using IMap, spam filtering is done server side. So, to identify false negative (FN) and false positive (FP), I cannot use some built-in feature of my mail clients (I have severals), I need to communicate with the server. Until recently, I was using the classic approach: when I got a FN or FP, I redirect the mail (with full headers) to a special address, which send it to dspam, telling it that it was a misclassification.

The problem with this approach in practice is that to mark a FP/FN I need to retransmit the mail, and move it to the correct folder, which is redundant. Of course, most mail clients can help doing that with some configuration, but still, that's several operations where it is not really needed. Moreover, in the case of FN, it means sending through SMTP a spam, which can sometimes be a problem.

So, I've made a script which watches the content of the spam folder and detects mails which are added and removed. This way, to mark a FN as spam, I just need to move it to the Spam folder: the script will detect that a mail has been added, and will re-train dspam with the signature of the email. For FP, it's the same thing: I just need to move the mail out of the spam folder, the script will detect that and call dspam with the signature of the moved email.

The script is a single-file python script, see attachment.

It works with Maildir style mailboxes, dspam and a mysql database. However, the principle is simple and can easily be adapted. The implementation is currently really dumb and could be enhanced (especially resource-wise, for the regular scan) but it's working.

The principle of the script is to scan the directory regularly to look for missing and added mails. The script must be plugged to the delivery system too (procmail in my case) to avoid trying to re-learn a spam already classified as spam.

How to use it :

- First, your dspam must be configured to put the signature as an header, not in the body of emails.
- Download the script on the server, let's say in ~/bin/dspam_auto.py and don't forget to mark it as executable.
- Edit the beginning of the script to adapt the settings to your configuration
  - - DB_USER, DB_PASS, DB_NAME : Access to the mysql database. You can reuse the DB you're using for dspam.
    - DB_TABLE is the name of the table which will be used to store the script information. It shall be a non-existing table, the default value is probably usually ok.
    - DSPAM_USER is the name of the dspam user you're using; usually your login name.
    - DSPAM_UID is the uid of the user for the script. It's probably good practice to use the same as in dspam, but it's in practice independant. You can check for user/uid in the table 'dspam_virtual_uids' of your dspam database.
    - LOG_FILE : Where to log all of script runs. It's really useful for debugging or just checking that the script isn't going rogue.
    - The dspam command to re-classifies FN/FP is in the classify function, feel free to adapt it to your installation. E.g., the script is currently using the option --client which you might not need.
- Check that DRY_RUN is True ; that's needed to initialize correctly the script database without polluting the dspam database.
- Initialize the database: ~/bin/dspam_auto.py init
- Add a regular scan, in cron (using crontab -e as example, all on one line):

*/10 * * * * $HOME/bin/dspam_auto.py update $HOME/Maildir/.Spam

This line make the scan run every 10 minutes which is probably largely enough (especially that the current version of the script is not really nice to database :). Note that the first scan will detect all existing spam as FN, so double check that DRY_RUN is True before screwing your dpsam.

- Modify the procmailrc to tell the script when each spam is detected. I have something like that in my .procmailrc :

# Spam filtering: :0fw | /usr/bin/dspam --stdout --deliver=spam,innocent --user pierre # Tell the script for each detected spam :0 ic * ^X-DSPAM-Result: spam | /home/pierre/dspam/dspam_auto.py push # And deliver spam in the spam folder :0: * ^X-DSPAM-Result: spam .Spam/

Note that the script is slightly racy, as calling the script and delivering the script is not atomic. However, as long as you don't run the scan every 10 seconds it shall not matter much, and recover itself from previous mistakes anyway. The way to implement that with no race condition would be to do the delivery ourselves, but I prefer not to for reliability reason: if my script is screwed up, it won't trash mails.

- Configuration is done. As long as you're in dry run mode, you can watch the effect of the script by moving mail in and out from the Spam folder. Typically, moving a spam out then in (don't forget to wait for the cron scan between operations) will produce those kind of log lines :

INFO 2008-11-09 11:40:09,710 [dryrun] Classify command: /usr/bin/dspam --signature=4916ba3b179033708835974 --class=innocent --source=error --client --user pierre INFO 2008-11-09 11:50:09,338 [dryrun] Classify command: /usr/bin/dspam --signature=4916ba3b179033708835974 --class=spam --source=error --client --user pierre

- Once you think that your all set (a.k.a, you've configured the above and at least one scan was fully done), you can set DRY_RUN to False and enjoy a simple way to mark FP and FN in imap :)