With a recent nightly release, apnscp now includes experimental rspamd spam filtering. rspamd will inevitably replace SpamAssassin as it's not only 3-5x faster than SpamAssassin with shortcircuiting enabled, but also provides excellent milter capabilities that can stop spam before slipping into the mail queue without needlessly complicating the stack or burning unnecessary CPU cycles. Better yet, rspamd scores across far more dimensions including fuzzy matches, neural learning, rate-limiting, dynamic reputations, and my favorite: greylisting.

SpamAssassin was the first viable contender in anti-spam firewalls that goes back to the early 2000s when it was introduced to me as part of Ensim WEBppliance's solution for an frenetic rise in spam. Architecture has not changed considerably since; in fact it still relies heavily on some clunky, obsolete regular expressions to discern whether spammers really are using characteristics of spam from 2011 (they aren't). These tests, roughly 2,000 in all, add needless overhead that any spammer worth his salt can workaround. Such tests are heuristics driven and heuristics are always taught, never learned; this is SpamAssassin's greatest weakness. Its effectiveness is predicated upon catching sloppy, malformed email or looking for previously known expressions that often result in false positives.

As a pattern can be defeated by simply not following the pattern, it implemented another layer, a naive Bayes algorithm that makes its determination based upon the absolute content of an email.

Naive Bayes in action: calculating the probability of A given B

Making a decision on the absolute content is making a decision without deliberating mitigating factors. Making a decision on absolute content is buying a car with electrical tape over the "check engine light" that appears to have an immaculate engine bay and not asking why that tape is there. We see there's no CEL, so it must be good because the engine bay is spotless.

Each factor - likelihood of A given B - is independently evaluated, which leads to a higher rate of false positives (non-spam misclassified as spam) and false negatives (spam misclassified as non-spam). This means that a large volume of marketing emails for example that include "discount" or "deal" can be incorrectly classified as spam because previously classified spams too contain those words. Context, therefore, is key.

Conversely rspamd uses a Hidden Markov model for its Bayes algorithm. Markov models are adaptive systems whose output depends upon its immediate preceding input, which in turn was an output for a previous problem and so on. Such systems don't look at the absolute content, but instead relative content and the strength of those relationships:

A Markov model (via http://setosa.io/ev/markov-chains/)

Given site X links to site Y and site Y links to site Z, then how should we trust site Z given the importance of X <-> Y? Google PageRank algorithm.
A patient comes in with stomach cramps. Patient is given medication, but relapses into sickness again. Patient visits, then is administrated alternative medication based upon previous treatment response. Medical treatment of unknown pathology.
Car runs fine. Engine bay is spotless, but tape is over CEL. I have never seen tape over a CEL, something must be awry. A savvy car buyer.

You can begin to see how Markov models work so much better for dynamic systems where probabilities aren't known but instead inferred. rspamd implements a Markov model in determining spam/ham that continues to evolve as spam techniques evolve.

Neat!

Configuring rspamd

rspamd can operate in a few flavors depending upon the number of servers you have, how much memory you can set aside, and whether you can trust the data fed into the system.

All commands use cpcmd to interact with apnscp's API. All commands assume you're up-to-date with apnscp via upcp. After running the sequence of commands, run upcp -b to run Bootstrapper.

For the less attentive variety cpcmd config_set system.integrity-check 1 performs the same operation as upcp -b but sends an email digest to the admin email upon completion.

Single-server scanning with local Redis

This is the default mode that unlocks all capabilities including greylisting, conversational whitelisting, fuzzy matches, user settings and neural learning.

cpcmd config_set apnscp.bootstrapper rspamd_enabled true

Single-server scanning with centralized Redis

rspamd scanning will continue to operate on the current server, but all statistics are sent to a centralized database. This ostensibly confers the advantage of speeding up its learning process.

cpcmd config_set apnscp.bootstrapper rspamd_enabled true
cpcmd config_set apnscp.bootstrapper rspamd_redis_server redisserver:port
cpcmd config_set apnscp.bootstrapper rspamd_redis_password redispass

Centralized scanning

A server can be designated to scan mail exclusively. Additional configuration should be taken to open the firewall ports and restrict trusted network traffic as well on the host machine.

cpcmd config_set apnscp.bootstrapper rspamd_enabled true
cpcmd config_set apnscp.bootstrapper rspamd_worker_socket somehost:someport

Low memory without Redis

Setting has_low_memory will put apnscp into a miserly mode stripping many auxiliary features, including Redis (backend becomes SQLite), neural learning, conversational whitelisting, and greylisting.

cpcmd config_set apnscp.bootstrapper has_low_memory true
cpcmd config_set apnscp.bootstrapper rspamd_enabled true

Training rspamd

By default rspamd piggybacks SpamAssassin. Depending upon mail volume this may take a few hours to a few weeks to develop a healthy model. You can jumpstart this by feeding your existing mail or by using readily available corpuses... corpii... Corp Por?

rspamc learn_ham and rspamc learn_spam will snarf the mailboxes it's fed learning all messages as ham (non-spam) or spam respectively.

Corpus list

Enron corpus (ham)
Enron spam corpus (spam)

Mailbox method

apnscp supports automatic learning by dragging email into and out of your "Spam" IMAP folder. Mail dragged out is automatically learned as ham. Mail dragged in is learned as spam. By default the Trash folder is not used to designate spam as some users have a tendency to delete read messages; this would greatly pollute its learned data.

You can enable learning mail sent to Trash as spam with the following:

cpcmd config_set apnscp.bootstrapper dovecot_learn_spam_folder '{{ dovecot_imap_root }}Trash'

"{{ ... }}" is used for variable expansion in Bootstrapper and must be included. By default the IMAP prefix is "INBOX.".

Switching to rspamd

rspamd is still quite experimental. While I'm actively using it among my servers, it may not be suitable for more rigorous workloads. Disclaimer in mind, you can switch to rspamd exclusively instead of SpamAssassin, which also reduces the Postfix per-delivery delay from 3 seconds to 0 seconds, by setting spamfilter:

cpcmd config_set apnscp.bootstrapper spamfilter rspamd
# And of course run the Bootstrapper as always
upcp -b

Other tuneables

rspamd offers a host of configuration variables. Just use cpcmd config_set apnscp.bootstrapper to override the value. These are stored in /root/apnscp-vars-runtime.yml and always take precedence over anything else.

Future

rspamd is inevitably the future for apnscp. Not only does it interrogate across more criteria, but it reduces false positives which are a bane of any sysadmin, business, individual, marketing organization, CMO... you get the idea.

I'm running rspamd in piggyback mode across a cluster of servers to evaluate efficacy. Expect a followup post in a month once sufficient information is collected. Until then, to give an idea of the potential of rspamd:

y-axis: # messages, x-axis: score | Clockwise, top-left: rspamd after Enron corpus training compared against SpamAssassin's prior scoring. rspamd targeting without training and same data set after Enron corpus training.

To wit, SpamAssassin has a 4 year head start on its scoring algorithm.