| |

DSPAM Frequently Asked Questions
[ Introduction |
Resources |
Screenshots |
FAQ |
Feature Requests |
Download |
License |
Paid Support |
Sponsors |
Mirrors ]
FAQ
General Information
Compiling DSPAM
Using DSPAM
Q. What is DSPAM and why should I use it?
A. DSPAM is an intelligent, adaptive spam filter capable capable of
learning what spam is and isn't based on each user's individual email behavior.
It is designed for both system-wide filtering and third party integration. You
should use DSPAM if you are looking for a scalable, fast, and accurate
spam filter that is capable of adaptive learning. Although it is a spam filter
by design, DSPAM has shown great proficiency in classifying any kind of
document into one of two concepts.
Q. Why should I use an open-source solution?
A. If you're a large business or ISP, you may be asking yourself
why you would want to use an open-source filtering solution over
a commercial solution. Please have a look at this article
and why open-source statistical filters give you a competitive advantage over
not-so-statistical commercial filters.
Q. What MTAs will DSPAM work with?
A. DSPAM can integrate with any MTA (Mail Transfer Agent) capable of
calling dspam either as a local delivery agent or via LMTP. It can also be
integrated via procmail scripts and .forwards. For all other
MTAs, DSPAM can be integrated in front of the MTA as an "appliance" or
by using a POP3/IMAP proxy. I've heard many success stories so far with:
Sendmail, Postfix, Exim, Courier, Communigate Pro, and QMail.
Q. How long does it take to start filtering spam?
A. If you set DPSAM up with global/merged group support, your users
can experience instant filtering out of the box. From a completely empty
corpus, however, we found DSPAM to start filtering its very first spam with
10-20 SPAMs [reported into the system]. DSPAM generally climbs to around 95-98%
accuracy within the first few days to a week or so, depending on how much mail
you receive. It catches a majority of spam with only around 100-200 spam
reported into the system, learning from the other few thousand it should
catch by itself in the meantime.
If you require faster precision results, you might want to start your users off with a seeded dictionary (see the dspam_merge tool), or use a global or
merged dictionary for your users to start with. Many advanced features of
DSPAM, such as Noise Reduction and inoculation do not
kick in until at least 2,500 innocent messages have been learned. You can use
the included dspam_stats tool to get a good idea of how effective your current
dictionary is.
Q. How is DSPAM with false positives?
A. In real-world scenarios, false positives have ranged anywhere from 0% (none) to 0.10% depending on both implementation and user's mail behavior.
Users with relatively predictable mail behavior (such as geeks, dweebs, and
freaks) have generally received very few false positives (less than 1 in
10,000 messages). Most false positives are likely to occur during initial
training. A feature referred to as the 'training buffer' can be
enabled to further water down filtering during training to help prevent false
positives (see the README for more information).
Alternatively, pretraining a few hundred nonspam can also help overcome
the risk of false positives during initial training.
Recent versions of DSPAM are equipped with an automatic whitelisting function
which whitelists senders the user received a lot of legitimate mail from
(automatically). This too helps prevent false positives during training.
Everyone's email is different, however. Your mileage may vary.
Q. What is with the serial number at the bottom of my emails?
A. Since some emails may have to be re-learned as spam or possibly a false positive, the original training data is stored temporarily for
relearning. This is usually necessary, as most mail clients rewrite and
even completely mangle the message when you forward it. Storing this information
server-side ensures that all the data is retrained correctly.
Each email processed includes a serial number to identify the signature. Th
is is frequently referred to as the DSPAM signature and looks like this:
!DSPAM:3eb2c721141672274659!
DSPAM has a user-level option to embed the signature in the headers instead
of the message body, however this will require the user to forward all
spam as attachments (a feature not all clients have). Users may also opt to
eliminate the DSPAM signature if they're willing to retrain using the
Web UI. Finally, if you're using strictly IMAP or webmail, then you can
eliminate the signature entirely and configure DSPAM to retrain using the
original message.
Q. How should I train DSPAM?
A. Just allow email to come in, and forward the messages that are spam. If you have both an innocent and a spam corpus, you can use the dspam_corpus
tool to feed it into the system. It is NOT a good idea to feed DSPAM a bunch of spam without feeding it a bunch of nonspam, as this could potentially
skew the dictionary and lead to false positives immediately (NOT because
DSPAM requires a balanced corpus, but as the result of the scoring of tokens that appear only in one corpus).
Special safeguards have been put into
place to prevent this under normal spammy email load, but force-feeding DSPAM spam is not recommended. The best advice for training a dictionary is to just act on the email you
receive after DSPAM is set up. If you have a large user base, you may
wish to create a global or mergedset of data to provide users with out-of-the-box
filtering. See the README for more information about global and merged groups.
Q. How is DSPAM different from SpamAssassin?
A. While both share the common goal of eradicating spam, the two
solutions bear very different philosophies.
Cocktail Approach vs. Centralized Adaptive Learning
SpamAssassin is designed with the arsenal (a.k.a cocktail or toolbox) philosophy and
aggregates the results from a myriad of different spam detection tests
with the hope that at least some of the components should detect an
inbound spam. These different tests
range from heuristic "rules" which identify specific characteristics in spam to
blacklists, and finally to limited Bayesian learning. DSPAM's philosophy is based on
the belief that machine-learning (basic artificial intelligence) can, in and
of itself, solve the spam problem without the need for human-maintained
rules, inaccurate blacklists, or any hodge-podge of solutions for that matter.
DSPAM's one central spam detection function
incorporates advanced, concept-based statistical analysis. This has resulted in levels of
accuracy up to ten times that of a human, with very few false positives.
DSPAM breaks down each email into its colloquial components, analyzes the
historical data for each component, and determines the most interesting
characteristics to judge an email by. While DSPAM supports many pre-filters,
post-filters, and additional layers of analysis, its central function lies
solely in adaptive learning and language analysis. This alone has yielded
levels of accuracy peaking at 99.991%.
We feel that the justification for our philosophy is in the credits. While
the SpamAssassin project requires over 100+ individuals to maintain,
DSPAM manages to delivery significantly higher levels of accuracy with only
one primary maintainer and a small pool of patch contributors.
Maintenance Burden
DSPAM's philosophy includes removing unnecessary human maintenance by means of
its learning abilities. Users simply need to forward spam they receive into
the system and DSPAM will automatically learn. There are no rules to update,
no thresholds to set, and very little systems administration after DSPAM's
initial integration. Through the use of various forms of community groups,
even the burden of training can be significantly reduced. Forwarding spam also
gives your users a sense of participation in your anti-spam efforts, reducing
the number of phone calls, email, and complaints you may receive.
SpamAssassin, on the other hand, pushes maintenance to the responsibility of
a central systems administrator and prevents the end-user from participating
in any capacity that significantly affects their filter training. This leaves
many end-users with the feeling of helplessness against false positives and
poor filtering results should their mail deviate from what is considered
"normal".
A systems administrator is required in order to update rulesets, tweak
performance, and etcetera.
The idea of removing end-user maintenance may be desirable by some very large
implementations, and so DSPAM can also be configured to support this
by allowing the systems administrator to train a global database of contextual
data. DSPAM, however, doesn't require full-time maintenance.
Behavioral Philosophy
One significant difference between the two tools is their philosophy about
behavioral learning. SpamAssassin's primary detection facility has been
designed to use a static set of rules to service all users of the system.
DSPAM's philosophy is that this presents a significant hindrance to accuracy,
because one user's spam is another user's mail. DSPAM is
adaptive to the behavior of each individual user on the system so that it can
custom-tailor its spam detection to that of each individual. To provide
"out-of-the-box" functionality, DSPAM also supports the use of
merged groups, which are global databases merged (at run-time) with
the user's own training data. This allows new users to receive instant,
high-quality spam filtering without losing the ability to train DSPAM for
their personal email behavior.
Newer versions of SpamAssassin support a form of Bayesian learning, but this
doesn't appear to operate on a per-user basis (at least not without
extensive configuration). The heuristic training guides also appear to
hurt the level of accuracy the Bayesian component could deliver if
SpamAssassin was made a pure statistical filter.
Technical Philosophy
DSPAM's technical philosophy is that of high-accuracy and high-performance in
an enterprise environment. For this reason, C was chosen as the language for
DSPAM. DSPAM has been implemented in systems exceeding 350,000 users and
experiences execution times as low as 0.01s (real time) when tuned properly.
The average system of around 100,000 users experiences around 0.06s-0.20s
processing time.
The philosophy behind SpamAssassin requires a focus around development effort.
SpamAssassin is written in Perl, which is generally a much easier language to
code the regular expressions used by SpamAssassin's heuristic rules engine.
As a result, even novice developers can quickly code new rules for SpamAssassin.
It's unfortunately very slow, however, compared to DSPAM and as a result, even
small implementations have been known to use up all resources on the machine.
Since DSPAM doesn't have any heuristic rules, it doesn't require the use of
regular expressions (which are always touted as fast in Perl). DSPAM's
tokenizing algorithm is, as a result, much faster then SpamAssassin's
analysis engine because it does not use regular expressions.
Because Perl is an interpreted language, and because of the extensive (and
unnecessary, in my opinion) pattern matching it performs, SpamAssassin ends up running much slower and using many more
resources than DSPAM, which uses a language compiled and assembled directly
into machine code.
While we believe the philosophies we've chosen for DSPAM are better suited for
the job (as evidenced by DSPAM's long-term accuracy), there are plenty of
other ideas out there that produce acceptable results.
Q. What do I do with spam I receive?
A. DSPAM is designed to 'learn' based on the spam (and nonspam) you receive, so whenever you receive a spam, you can forward it to a special email address configured by
the administrator and DSPAM will automatically analyze it and learn.
Alternatively, users can train through DSPAM's web UI. This is an excellent way to insure that DSPAM won't be obsolete a year from now, but continue
to learn the new tricks of spammers.
Q. What is a quarantine box?
A. Each user has a quarantine box which holds messages DSPAM thought was spam. Rather than simply delete these messages, the quarantine box gives the user the
ability to identify the occasional false positive and re-learn them as innocent emails. This is a very important step in the learning process that many
other tools don't provide. It is understood that no spam filter will be 100% accurate, and therefore it is important to be able to learn from its mistakes.
DSPAM's quarantine makes it much easier to manage spam than some other
solutions because it sorts the quarantine based on confidence. Therefore, any
likely false positives are going to rise to the top of the quarantine for
easy review.
Users who would prefer to tag spam may specify this in their preferences.
Systems administrators looking to integrate DSPAM with fancy IMAP folders
can also find support for this.
Q. Does DSPAM support whitelists?
A. DSPAM doesn't have a whitelist manager, rather whitelisting is an automatic function of DSPAM's Bayesian filtering mechanism. As you receive more emails from
your colleagues, their from addresses and other identifying information (such as signatures) is automatically learned by DSPAM to create an internal whitelist.
On top of this, DSPAM supports an automatic whitelisting feature which
identifies individuals you converse with the most who have never sent spam.
Q. Does DSPAM filter viruses?
A. DSPAM ignores attachments, javascript, and the like. Therefore, it does no filtering of viruses, however the recent SoBig.F virus was caught by several
implementations based on the message content, and many similar viruses are
caught quite easily based on their message content alone. As of version 3.6,
DSPAM can integrate seemlessly with Clam Antivirus for virus filtering.
Q. How is DSPAM different than every other statistical filter?
A. A very valid question...since Bayesian is Bayesian and Chi-Square is
Chi-Square, why are there so many darn filters out there, and what makes ours
different? DSPAM has a two-fold development focus:
- The first emphasis is
placed on scalability in large-scale environments. DSPAM's largest-scale
implementation has been reported to be over 350,000 mailboxes. DSPAM has been
clocked at running with an execution time as little as 0.01s in real-world
large-scale environments making it extremely fast and lightweight. DSPAM's
design also allows it to be run on several different machines. A spam filter
is useless if it can't run on a large production environment, which is why
DSPAM has been created with enterprise-class performance in mind.
- The second project emphasis is in R&D and attempting to provide better
data for whatever combination algorithm(s) you choose to use. DSPAM supports
several different analysis engines including Graham-Bayesian, Burton-Bayesian,
Fisher-Robinson's Chi-Square, and Markovian Discrimination. All of these
algorithms work as
good as they're going to work, and so we've focused on improving the world
around these algorithms to end up with
better data to provide them with. Some of our more
advanced algorithms include chained tokens,
advanced deobfuscation techniques, neural networking, and
Bayesian Noise Reduction which provides a noise reduction
algorithm for advanced filtering.
These two primary focuses make DSPAM
one of the most scalable AND accurate filters available today.
Q. DSPAM just isn't going to meet my needs,
can you recommend some other filters?
A. Absolutely! The following filters are also very good statistical
filters written by some very bright individuals I've gotten to know a bit:
- CRM114 Discriminator - Bill Yerazunis' brain child. A Markovian/SBPH capable filter that is highly accurate
and fairly fast. Bill invented the concept of message inoculation, and co-authored the Internet-Draft with me.
- POPFile - Dr. John Graham-Cumming's statistical filter. Very effective and widely acclaimed.
- Bogofilter - Originally
written by Eric Raymond, now the work of David Relston, Bogofilter uses
the Fisher-Robinson Chi-Square algorithm exclusively.
- SpamProbe - Written by
Brian Burton, this filter implements Brian's Bayesian approach,
multi-word or "Chained" tokens, and can achieve filtering rates up to 99.9%.
- Death2Spam - A commercial product
designed by Richard Jowsey, this solution focuses on high-end, large
scale implementations.
- CipherTrust IronMail - A commercial
email security appliance developed by the company I work for. We have one of
the best reputation systems on the planet, and can filter massively large
volumes of spam at the border.
Q. Does it work with OSX?
A. Yes, and it runs on my Powerbook too.
See the mailing list archives or the "DSPAM on Mac OS X HOWTO" link on the
project's home page.
Q. Does it work with Windows?
A. v3.2 included a Windows build supplement, which
included the necessary Visual C++ project files and portage to compile
the agent and tools under Windows. Nobody wanted to maintain it, however, so
it is no longer included with the distribution. It's probably best to build
it under Cygwin using the general distribution (which builds fine).
Q. Are you ripping off some of Bill Yerazunis' Research?
(This question comes in reference to my recent additions of Markovian
weighting to DSPAM)
A. No, in fact, Bill has given me his full blessing and assistance in
implementing Markovian discrimination in DSPAM. I
wouldn't have even attempted it without such blessing. Bill
and I have collaborated on other research in the past (such as message
inoculation), and has co-authored the chapter on Markovian discrimination
in my book, "Ending Spam". I, for one, love both knowledge and
computer science, and I also thought it would be good to implement
these topics myself since they're covered in my book. I've learned a lot
about how his algorithms work, and am quite impressed. I still think you
should run CRM114 if you are looking for a Markovian classifier, in part
because it contains plenty of Bill's optimizations and
because it's an awesome tool.
But I'm hoping too that Bill's research can benefit people who have a need
for it with the tools and interfaces provided by DSPAM, if CRM114 isn't a good
fit. I think Bill would agree, which is probably why his filter
provides many of the algorithms used in DSPAM (and other filters).
Bill Yerazunis himself adds: "No, it's fine. Algorithms are algorithms, and if
I didn't want other people to use them, then I wouldn't have published them,
or GPLed them. Jonathan has already 'made his bones' in spam fighting; he can
certainly use the theories, the algorithms, and the code. And he's even
putting author credit on it. What more could you want?"
Q. My compiler complains about [some library]
A. Some libraries may not be installed in a standard location where
the compiler looks (by default). So you will need to do one of:
- Add the paths to your LD_LIBRARY path
- Add the paths to ld.so.conf (in Linux)
- Copy them to /usr/lib or /lib
- Use the --with-[driver]-includes and --with-[driver]-libraries configure options
Q. What do I do if my platform isn't supported?
A. If your platform supports the POSIX interface, you should be able to compile DSPAM with no or little tweaking. If you are interested in contracting
a port, please email me.
Q. configure complains about my libdb version
A. There are several different versions of libdb on many systems, and
each has a separate db.h header file. configure with the option --with-db4-headers=DIR pointing to the correct directory where configure can find your
libdb4 header files. You may also need --with-db4-libraries. If you still
get this error, manually check and make sure that your version of the
includes matches up with your version of the library.
Q. Building on OSX with MySQL, I get "ld: common symbols not allowed with MH_DYLIB output format with the -multi_module option"
A. This is due to a restriction in OSX allowing only one definition of
each symbol within a shared library. The following workaround should fix your
problem (adjust the paths to your MySQL client library accordingly)...
# cd /usr/local/mysql/lib
# mv libmysqlclient.a libmysqlclient.a.original
# mkdir /tmp/mysql
# cd /tmp/mysql
# ar x /usr/local/mysql/lib/libmysqlclient.a.original
# ld -r -d my_error.o
# mv a.out my_error.o
# ld -r -d charset.o
# mv a.out charset.o
# libtool -o /usr/local/mysql/lib/libmysqlclient.a *.o
Q. My dictionaries are getting big (+10MB each)
A. This is a common occurance while building your dictionaries over
the first 15-60 days. About 70% of the tokens in each dictionary are
unuseful, and will be removed from each user's dictionary at 15, 30, and 60
days depending on just how unuseful they are. If your driver supports it,
be sure you have dspam_purge or the purge.sql scripts configured to run
nightly, and if you can't wait that long here are some
things you can do:
- If using dspam_purge, tweak PURGE values in dspam.conf. If using purge.sql,
tweak them there. The defaults are
very conservative time periods. In the real-world, you can probably set these
much lower. This will remove the unuseful information much quicker.
- The space used is a result of implementing chained tokens. While they
are extremely effective, they do use a fair amount of disk space (especially
in the beginning). You can disable chained tokens, but please read the
white paper first. The cost is obviously
effectiveness; without chained tokens, DSPAM may not be as effective.
If you've turned on SBPH (Sparse Binary Polynomial Hashing), then you've no
business complaining about disk space ;) Turn it off and use Chained Tokens or
single tokens instead.
- If you are using a SQL-Based storage driver, you may also wish to remove
ANY tokens from the database that haven't been hit in 4-6 months. Spam
does change slowly over time periods, and so older data may no longer be
useful.
- If you aren't using a SQL-Based storage driver, consider using one. They
provide much more efficient disk space utilization.
- Probably the most notable change will come from setting your training mode
to TOE (Train-on-Error). Using TOE, messages are only trained when an error
has occured, which means significantly fewer tokens in the database. This
mode also writes only when a misclassification is being corrected, which means
much less thrashing on your large-scale installs. In some cases, TOE yields
better levels of accuracy. In some, worse. So be warned.
- If you're using the hash driver, you should run cssclean and csscompress
about once a month.
Q. How long do messages take to process?
A. Depending on what type of storage driver is used and your system's configuration, this may vary greatly.
Processing time hovers typically between 0.01s and 0.07s for most messages
(peaking at about 0.20s). Messages with large attachments
(6-10MB) may take a little longer due to I/O delay (if they are binary
attachments, they are ignored). On some slower systems, or using slower
storage drivers, processing time may take a few seconds. If you need
the absolutely fastest operation, consider the hash driver or MySQL.
Q. How can I set up SpamAssassin-like "out-of-the-box filtering" for my users?
A. DSPAM supports global classification groups and global merged
groups for this purpose.
Global classification groups allows DSPAM to provide a filtering "parent"
relationship with all new users on the system, until they have built their own
useful dictionaries. Merged groups are similar, but also allow the user to
train against the global database; their training data is "merged" with the
global database in real-time, allowing them to customize DSPAM to their own
behavior. In both cases, users who do not wish to ever forward in spam will
have the benefit of being protected by the global dictionary.
See the README for more information about both types of groups.
Q. I have a huge user base. What are some ways to tune DSPAM?
A. First, very cool on the huge userbase thing. Be sure to
shoot me an email and let me
know about it. Large userbases sometimes require a few changes in your
approach to DSPAM. There are a lot of different ways to tune DSPAM to function
well on very large installations, and I'll outline a few here. Feel free to
contact the dspam-users list for more ideas.
- If you're not already using it, try the MySQL or hash drivers.
These are the two fastest and most stable drivers available. Hash is the fastest,
MySQL is pretty fast but also has recovery.
Either of these will help get you going with a scalable backend.
- Consider tuning MySQL. So far, all of the large implementations
I've heard of don't have any problems with DSPAM, but have experienced more
of a bottleneck in MySQL. There are a lot of things you can do to tune MySQL
for this large of an installation. Consider emailing the mysql users lists for
additional ideas. Some of my ideas include:
- Switch to the speed-optimized schema of DSPAM objects (see the
tools.mysql/ directory). The optimizations used in this schema cost a
little more disk space, but greatly speed up reads and writes. If you are
running an older version of DSPAM (prior to 2.10), consider
the speed-optimized version of MySQL tables, which uses fixed-length keys.
- Try running more frequent 'ANALYZE TABLE..' and 'OPTIMIZE TABLE..'
calls. These two functions help MySQL to perform much better over time.
- Consider switching to InnoDB database type (from MyISAM), which uses
row-level locking instead of table-level locking, and may help compensate for
any lock contention issues.
- Version 2.10 support client compression, which will greatly improve the
bandwidth used between the MySQL server and the DSPAM agent (if they're on
separate machines). You can --enable-client-compression prior to compiling
DSPAM.
- If you haven't already done so, consider running multiple copies of MySQL. Start out with one per machine. As long as the users are local to the machine, there is no reason this approach will not be successful. Once you have distributed your servers if you still require additional scaling, consider two or three different MySQL databases (or at least schemas) per machine. It is important to make sure that each user is mapped to a specific instance or schema.
- Consider some mysql tweaks. Some have been provided in the README file
in tools.mysql_drv.
- Switch to TOE Mode. DSPAM v2.10 supports TOE (Train-On-Error)
mode, which only performs writes to the database in the event that a
misclassification has occured (or if a user has fewer than 4000 innocent
messages in corpus). Train-on-error mode should make a significant reduction
in the number of writes (and therefore locks) being performed on your
database, and may actually improve accuracy as TOE has been known to do so.
The default mode of learning is TEFT (Train Everything). This performs a much
more detailed training of incoming messages and can more easily adapt to new
types of email behavior for users, but does use up a significant number of
resources. This is a definite thing to try if you're bottlenecked!
- Increase Purge Intervals. The standard purge intervals can be
shortened to purge data quicker from the system, leaving less data in the
database. The purge of any stale tokens is particularly important.
- Use DSPAM's Source Address Tracking to Block Spammer IPs at the borders. DSPAM
has the ability to report source addresses of spam via the syslog facilities.
Taking this information and feeding it to a firewall to temporarily block
an address for 24 hours is a great way to greatly reduce the amount of incoming
spam. This will not only conserve server resources, but also conserve
bandwidth. Optionally, you may be interested in using the Statistical Blackhole List, which is an automated and statistically driven blackhole list server/client. Until a public server is available, setting up
a local system would help you greatly to identify spammer IPs.
Q. All my messages are getting delivered to root! or mail! or I'm getting
funny messages about users not matching! *gasp* *panic*
A. This is probably because you skimmed over the 'TRUSTED USERS' section
of the README file. Make sure you have added your MTA user, your MTA shell
user, and your Apache user (for the CGI) to the trusted.users file. Until you
add them, DSPAM will not trust them to set the user using --user, and will force
the user to match their uid.
Q. What happens if another message comes into my quarantine before I hit 'DELETE ALL' ?
A. The Quarantine CGI has a protection to prevent messages from being accidentally deleted which may have come in while you were viewing your quarantine. The filesize of the mailbox is noted when the user goes to view their quarantine, and will fail to 'DELETE ALL' if the mailbox
has since changed in size.
Q. What is TWEAK -1?
A. In the CGI, a button labeled "Tweak -1" exists. If you are anal
about keeping accurate web stats as I am, you want to make sure that messages
you forward in that are NOT spam don't get counted against the web stats.
For example, I forward in virus-ridden emails and the occasional completely
blank message - neither of which DSPAM is expected to catch. Clicking "Tweak -1"
for each of these emails I send in corrects the web stats so as not to count
them against DSPAM's accuracy. That's all it is!
Q. I've fed DSPAM thousands of spam, and am only getting marginal accuracy. What's up?
A. Your problem might be that you've fed DSPAM thousands of
spam, but have not fed it enough nonspam for it to learn adequately. It's
typically a bad practice to feed a statistical filter a grossly unbalanced
corpus of mail, and if you're using a version of DSPAM that has a "training
buffer" enabled by default, feeding a ton of spam can also cause it to start
watering down its results until you feed it more ham.
This watering down gets stronger the higher your spam
ratio is, in an attempt to prevent false positives - so the more spam you
feed it, the worse your accuracy will get. There are a few things you
can do to remedy this:
- Turn off the training buffer ("Feature tb=5" in dspam.conf) if it is
turned on, or lower the buffering level.
You'll want to use a value lower than 5, as this is DSPAM's default. A value of 0 will disable this
protection entirely. Find a value that gives you the best spam filtering without
allowing for too many false positives.
- The better solution may be to feed DSPAM enough nonspam to exceed
the training threshold (2500 messages). This will not only disengage the
statistical sedation feature, but will allow other algorithms to kick in,
such as Bayesian Noise Reduction, which only engage after training.
- Try deleting your database and retraining using the dspam_train tool,
instead of dspam_corpus. dspam_corpus isn't really designed for building
highly accurate pretrained databases.
If this doesn't work, or you're showing TI+IC values over 2500 in dspam_stats for your
user, another common problem is incorrect training parameters. When a message
is retrained in DSPAM, be careful not to specify it as a corpusfed
spam, but as an error. Check your commandline arguments, and
make sure you're using --source=error and NOT
--source=corpus. --source=corpus is for messages that have not been processed
by DSPAM. --source=error is for messages that have been processed by DSPAM,
and were erroneously classified.
It's important not to specify corpus training on missed spam, because
DSPAM only learns corpus messages, and doesn't relearn them.
So you'll end up with 1 spam tick mark and 1 innocent tick mark, instead of the
correct result: 1 spam tick mark and 0 innocent tick marks.
Q. Is libdspam Thread-Safe?
A. 3.2 and higher is thread-safe, however this is largely storage driver
dependent. At the moment, mysql_drv, pgsql_drv, and hash_drv are thread
safe. BDB and SQLite drivers don't permit concurrent reads/writes and so they
never likely be thread-safe. Oracle may be a good candidate for multithread
ing in future versions, but is more complex than the other two SQL-based drivers.
To use DSPAM in a multithreaded environment, you'll need to
create a separate DSPAM context for each thread and use dspam_attach().
man libdspam for more information.
Alternatively, if you're not concerned about concurrent processing, you should
be able to use libdspam with your multithreaded application and with any old
storage driver by simply using a mutex to control access to libdspam
functions.
Q. If I develop problems with accuracy, how can I decrease false positives?
A.
False positives can creep up for a number of reasons. If you receive or have
trained on a very large amount of spam, you may experience false positives
while training dspam. If you neglect your quarantine and fail to retrain a
few false positives, this can snowball into more. Here are some things you
can do to fix situations where accuracy degrades:
- If you don't check your quarantine very often, you might consider using
TOE-mode training. This is the only training mode that won't automatically
learn based on the assumption that if you didn't correct a problem, that DSPAM
was right in the first place, and it will give you the ability to neglect
your filter without degrading its efficiency too much.
- If you've changed your training mode periodically, you may consider
reraining from an empty corpus. The different training modes function
optimally when you've been using them exclusively.
- If you receive a very large amount of spam, you may consider using
dspam_train to train a small corpus of nonspam. This may initially decrease
your filter's spam-catching efficiency, but in the end should help prevent
false positives.
- If you've experimented with whacky configurations such as Markovian
Discrimination, playing with algorithms or features, you may wish to restore
your configuration back to its original and try retraining. The default
settings in dspam.conf are typically optimal for a good balance between
efficiency and accuracy.
- With that said, disabling the 'graham' algorithm and using the 'burton'
algorithm exclusively has shown to decrease false positives, however this
also increases the number of missed spam. You can change the line
"Algorithm graham burton" to "Algorithm burton" in dspam.conf and see if
that helps.
- If you receive very little legitimate mail, you may experience some false
positives based mostly on the time of day or day of week mail has arrived.
You can add "IgnoreHeader Date" to dspam.conf to ignore the date field. This
has shown in many tests to decrease false positives, but also increases the
number of missed spam.
- If you are using blacklists, other spam filters, or any other form of
training guides, you may be doing more harm than good. If the tool is less
accurate than DSPAM is, then you will be training DSPAM to be just as inaccurate
- *LOOK* at the X-DSPAM-Factor header and see why it is judging your messages incorrectly. If it looks like there are a lot of tokens that should otherwise be considered nonspam, your retraining may not be working or you may have missed many false positives you did not realize.
[ Introduction |
Resources |
Screenshots |
FAQ |
Feature Requests |
Download |
License |
Paid Support |
Sponsors |
Mirrors ]
|