New Page 2
 

February 28th, 2007

Learn How to Protect Your Family from the Worst of the Web!


-
If you don’t believe Google’s Duplicate Content Filter exists, I have Dramatic Proof their Internet content filter exists and it’s very effective.

On July 5, 2005 I published an article entitled “7 Top Ways to Avoid Link Theft” which was picked up and included as content on other websites.

Before the article was released I checked on Google whether any results already existed for the exact phrase “7 Top Ways to Avoid Link Theft” and there were no listings for that term.

Over the next few weeks I monitored through a search query on Google how many results appeared in Google for the title of my article. One week after publication there were 6,760 results listed in Google, a week later it was 14,100 and it reached a peak of 17,000 results by July 26, 2005.

4 weeks after publication the results in Google had fallen slightly to 16,600.

Almost 6 weeks after publication the results listed in Google had fallen to 44.

In a matter of less than two weeks the number of search results on Google.com for the title of my article had gone from 16,600 to just 44.

In case you’re thinking this is because all these other websites dropped by article and replaced it with other content I should add that a search on Yahoo.com on the same day still showed 14,300 results for my article.

What’s more of these 44 results on Google, more than half consist of listings from the same websites. In other words some sites have the same article duplicated on different pages on their website.

So Google’s Internet Content Filter is not used to remove duplicate listings from the preferred websites it chooses to keep in the search results.

On August 28th, 2005 8 weeks after first publication I distributed the article again to a new list of article sites to repeat the process. After 6 weeks the same article had reached a peak of 5,620 results on Google. Less than 2 weeks later the results had fallen to 217.

For me this was dramatic proof that Google’s Duplicate Internet Content Filter is active and very effective. If you’re wondering if other major search engines have a duplicate content filter I can confirm that Yahoo certainly does. The same article which was once listed on 14,300 sites on Yahoo, has fallen to 344 over the same time period.

From these results it would seem Google takes about 6 to 8 weeks to remove duplicate content using its Duplicate Internet Content Filter.

But the question remaining is just how does Google decide which out of over 16,000 results does it keep and which does it reject?

I have witnessed situations where my own articles appear in results on other websites, but are not listed in the results for my own website.

So clearly Google does not take into account who the originator and author of the original article was when deciding which sites will remain in its search results.

It also seems to have nothing to do with where Google first finds the article.

Some articles I have published to my website for several weeks before releasing them for distribution to other websites.

In that time the Google spiders have visited my site several times and Google has had enough time to work out that the article was first found on my site.

It would be interesting to see if it’s possible to work out what factors Google is using in its Internet Content Filter to decide which results to keep in its listing and which ones to remove. But that’s for another article.
About the Author

Tony Simpson is a Web Designer and Search Engine Optimizer who brings a touch of reality to building a Web Business. A related report on article distribution is at: http://www.webpageaddons.com/stp/announcerclaim Article Announcer Review - Testing Product Claims

February 27th, 2007

Learn How to Protect Your Family from the Worst of the Web!


-
A common problem with filters is the fact that they are
a one-size-fits all solution to SPAM. The rules are concrete
and only change based on input from updates from the Anti-spam
service.

SPAM changes too quickly to make that method effective.
Additionally, what is SPAM to you may not be to someone else.
That is where Bayesian filters come in.

They are very effective at eliminating SPAM and have
very low false-positive rates for their users.

Bayesian filters are based on Bayesian logic, a branch
of logic named for Thomas Bayes, an eighteenth century
Mathematician.

This type of logic applies to decision making by
determining the probability of a certain event based on the
history of past events.

Using this as a model seemed a logical step for SPAM
filtering. If you can predict what SPAM will look like now
based on what is has looked like in the past, you are halfway to
the solution.

To finish solving the problem, Bayesian filters were
developed to be dynamic and continue to be effective as the SPAM
changes.

Bayesian filters are content based. They look for
characteristics in each email that you receive and calculate the
probability of it actually being SPAM.

These characteristics are generally words in the content
and the header file information that each email contains. They
can also include common SPAM HTML code, word pairs, phrases, and
the location of a phrase in the body of the email.

Typical words in SPAM would be “Free” and “Win”, while
“humility” would probably not appear. The filter begins with a
50% neutral score for the email, and then adds points for SPAM
characteristics.

Likewise, deductions are made for non-SPAM characteristics
present. The total score is calculated and then action is taken
based on its likelihood of being SPAM.

The filter does not assume that all arriving email is
bad, rather that all email is neutral and should be considered
equally.

Bayesian filters are better than traditional content
scoring filters in that they are trained by you to recognize
your email.

A doctor, for example, might have many emails
legitimately using the word “Viagra”. A traditional content
scoring filter would probably shoot that email to the SPAM
folder, or delete it.

This would result in a high false-positive rate for the
doctor, even if you don’t want Viagra emails. The filter will
build a list based on the doctors email use and corrections to
incorrectly marked email.

The initial training period may be a little time consuming,
but once complete offers a tailored solution to SPAM
control for each user.

In addition to protecting the good email, the filter makes
it difficult for Spammers to trick as every filter will have
individual requirements.

That being said, Spammers do have a few weapons in their
arsenal to attempt to circumvent Bayesian filters. The easiest
would be to create SPAM that looks like an everyday letter.

This would remove their ability to use typical marketing
techniques and so is not as likely with normal commercial email.
For the purveyors of fraud, however, this would be easier.

Spammers could also so weight a message with a common
good word, or distort the bad ones, that it becomes scored as
neutral or lower and get through.

Once correctly marked as SPAM by you, though, the filter
will adjust and not be fooled again. This automation and
ability of the software to grow as you and SPAM change over time
is key to the significance of these types of filters.

Widespread use of good Bayesian filters will not only
eliminate SPAM on your end, but would reduce the practice of
Spamming altogether. If they cannot get the mail through, they
are just wasting their time.

About the Author

Debbie Hamstead is the webmaster of http://www.StompingOutSPAM.com
Offering a comprehensive Quick Start Guide to keeping SPAM out
of your inbox. She also manages http://www.nichesites4profit.com

February 26th, 2007

Learn How to Protect Your Family from the Worst of the Web!


-
So What Makes a Good Spam Filter Anyway?
By Alan Hearnshaw

Spam Filters. Most of us know we need one. Some of know we need a better one, but how many stop to think what actually makes a good spam filter in the first place?

This is not just a rhetorical question. It is a question that many users and many developers - do not ask, and consequently, goes unanswered.

Maybe this could be better answered by defining here the qualities of the perfect spam filter. We ll call our perfect spam filter the SpamSplatter 3000 . Here are some of the defining qualities of SpamSplatter 3000

1. It requires zero interaction from the user.
2. It produces zero false positives (good messages identified as bad) and zero false negatives (bad messages identified as good).
3. It is transparent that is, you only ever see good messages and never need even be aware that spam exists.

That s it. Not much of a shopping list is it?
Of course, SpamSplatter 3000 hasn t been invented yet (and if it does, I want a piece of the action), but it does give us a frame of reference when looking for the best filter we can find.

Let s take each point in turn:

It requires zero interaction from the user
There are two kinds of filters that come near to this ideal currently: Bayesian Filters and Community Filters.
Bayesian filters strip messages down to small word bites , or tokens and maintain a database containing lists of good and bad tokens. When a new message is encountered, the filter strips this message down to tokens, compares it to the database, and applies a formula based on the British scientist Alan Bayes formula for probability calculation.
Over time, the Bayesian filter learns the characteristics of spam messages.

Community Filters simply work on a voting system whereby every user that receives a spam message votes it as spam. This information is stored on a central server and when enough votes are received the message is banned from all users in the community.

As can be seen, the user interaction from these types of filters is mainly limited to two button operation correcting wrongly identified messages and the more accurate the filter, the less those buttons are used.

OK, so that s pretty good. Not exactly zero interaction, but if the filter is accurate enough, then it should be pretty near. That brings us to point two:

It produces zero false positives or negatives
This is the area in which most spam filter development is concentrating and things are getting pretty good nowadays. It is not at all unusual to see an efficient modern filter achieve accuracy of 96% or better. It is, of course, far better to have a false negative than a false positive if you are ever going to tear yourself away from the killed mail folder!

Of course, by definition, community filters cannot reach 100% accuracy as someone has to be getting the spam to be voting it as such!
Theoretically, a Bayesian filter may be able to eventually get quite close to 100% accuracy, so at least there is hope there.
Content based filters (those that look for certain words, phrases or other indicators in a message to identify it as spam), will almost certainly not get much higher accuracy figures than the best of them can achieve today. Adapting to changing spam requires new filters to be created on an ongoing basis.

And finally, we come to the holy grail of spam filtering:

It is transparent
Strangely enough, not enough work seems to be done in trying to achieve this goal. Some of the best filters on the market today identify spam with impressive accuracy and then simply place them in a killed mail folder for your later perusal.
Now, forgive me if I m missing something here, but isn t the point to save you having to wade through the junk mail? Isn t that what you bought the filter for? With the SpamSplatter 3000 , you don t need to do that.

As we haven t achieved 100% accuracy yet (and probably never will), the only way to free us from checking the killed mail folder is a challenge/response system. This is where a message is automatically sent back to the sender requiring them to take some action for their message to actually be delivered.

Some systems tend to go overboard with the challenge/response system. These systems - often called Whitelist systems - block messages from anyone that isn t in the user s friends list. Guaranteed 100% effective, but too drastic a measure for most users.

Now, it seems that the most intelligent use of this system would be to send challenges only to messages that were flagged as questionable . Good message can be delivered, definite spam can be deleted and questionable ones would earn themselves a challenge message.

So, to sum up, let s rewrite the qualities of our perfect filter and get a shopping list of what to look for while we wait for the SpamSplatter 3000 to arrive:

1. Simple, minimal setup and maintenance.
2. Extremely low rate of false positives and as few false negatives as possible.
3. A transparent fail-safe mechanism whereby the victims of those false positives can force the message through to you.

It s simple really. Now, who s going to build me this SpamSplatter 3000 ?

Alan Hearnshaw is the owner of http://www.WhichSpamFilter.com, a site which provides weekly in-depth spam filter reviews, user help and guidance and a community forum.
alan@whichspamfilter.com

About the Author

Alan Hearnshaw is a computer programmer and the owner of http://www.WhichSpamFilter.com, a site which provides weekly in-depth spam filter reviews, user help and guidance and a community forum.

February 25th, 2007

Learn How to Protect Your Family from the Worst of the Web!


-
If you don’t believe Google’s Duplicate Content Filter exists, I have Dramatic Proof their Internet content filter exists and it’s very effective.

On July 5, 2005 I published an article entitled “7 Top Ways to Avoid Link Theft” which was picked up and included as content on other websites.

Before the article was released I checked on Google whether any results already existed for the exact phrase “7 Top Ways to Avoid Link Theft” and there were no listings for that term.

Over the next few weeks I monitored through a search query on Google how many results appeared in Google for the title of my article. One week after publication there were 6,760 results listed in Google, a week later it was 14,100 and it reached a peak of 17,000 results by July 26, 2005.

4 weeks after publication the results in Google had fallen slightly to 16,600.

Almost 6 weeks after publication the results listed in Google had fallen to 44.

In a matter of less than two weeks the number of search results on Google.com for the title of my article had gone from 16,600 to just 44.

In case you’re thinking this is because all these other websites dropped by article and replaced it with other content I should add that a search on Yahoo.com on the same day still showed 14,300 results for my article.

What’s more of these 44 results on Google, more than half consist of listings from the same websites. In other words some sites have the same article duplicated on different pages on their website.

So Google’s Internet Content Filter is not used to remove duplicate listings from the preferred websites it chooses to keep in the search results.

On August 28th, 2005 8 weeks after first publication I distributed the article again to a new list of article sites to repeat the process. After 6 weeks the same article had reached a peak of 5,620 results on Google. Less than 2 weeks later the results had fallen to 217.

For me this was dramatic proof that Google’s Duplicate Internet Content Filter is active and very effective. If you’re wondering if other major search engines have a duplicate content filter I can confirm that Yahoo certainly does. The same article which was once listed on 14,300 sites on Yahoo, has fallen to 344 over the same time period.

From these results it would seem Google takes about 6 to 8 weeks to remove duplicate content using its Duplicate Internet Content Filter.

But the question remaining is just how does Google decide which out of over 16,000 results does it keep and which does it reject?

I have witnessed situations where my own articles appear in results on other websites, but are not listed in the results for my own website.

So clearly Google does not take into account who the originator and author of the original article was when deciding which sites will remain in its search results.

It also seems to have nothing to do with where Google first finds the article.

Some articles I have published to my website for several weeks before releasing them for distribution to other websites.

In that time the Google spiders have visited my site several times and Google has had enough time to work out that the article was first found on my site.

It would be interesting to see if it’s possible to work out what factors Google is using in its Internet Content Filter to decide which results to keep in its listing and which ones to remove. But that’s for another article.
About the Author

Tony Simpson is a Web Designer and Search Engine Optimizer who brings a touch of reality to building a Web Business. A related report on article distribution is at: http://www.webpageaddons.com/stp/announcerclaim Article Announcer Review - Testing Product Claims