Growth of Gmail Spam Filters | Email Deliverability Perspective
When it comes to email, Gmail is on top. Be it for Email deliverability, mailbox provider or spam filters. Since it’s inception, Gmail has been trying to avoid spamming for its users.
It’s a really interesting story of Gmail adapting to the spam filters and how it mastered it over the years.
I was recently watching the Harry Potter movie series. Of all the professors at Hogwarts teaching magic, I was always interested to know who will be the next “Defence against the Dark Arts teacher. It was a much-maligned professor position. Everyone has a favourite (mine was Remus Lupin because he actually had the skills to teach!). This professor should obviously have an upper hand in terms of skills against the Death Eaters so as to stop them from causing destruction and killing people. This got me thinking about Mailbox providers and how they act like the defense against the spammers, the phishers and other cyber-criminals trying to stir trouble by penetrating their defenses.
According to a survey done by Statista, active Email accounts are expected to hit 5.6 billion this year and Talos Intelligence says spam is accounted for 85% of Email traffic as of December 2019.
Gmail is one of the largest mailbox providers with over 1 billion active accounts. The amount of spam they get daily is monstrous. The spam filters are extremely efficient and keep the users’ mailboxes clean and usable.
In this post, we see why we have so much spam floating around and how Gmail is leading the war with spammers efficiently. In the end, we also try to touch upon best practices to maintain Gmail Deliverability.
Gmail has currently vouched for keeping spam away from inboxes with 99.9% accuracy.
That’s “ridikulously” accurate!
And even within the world of content-based filtering, I think it will be a good thing if there are many different kinds of software being used simultaneously. The more different filters there are, the harder it will be for spammers to tune spams to get through them.
~ Paul Graham ( A Plan for Spam, Aug 2002)
The Story Of Spam Filters – From the Beginning of Time
The first Email spam message was sent by a guy named Gary Thuerk in 1978 to 600 people via ARPA and he was reprimanded for it and asked not to do it again.
In 1994, The first large scale spam was distributed across USENET – “Global Alert for All: Jesus is Coming Soon” was cross-posted to every newsgroup.
From as little as 6% of global Email traffic being spam in 1998 to rising to 40% in 2002, spam was increasingly becoming a problem. According to Talos, currently, with the explosion of email marketing, this number stands at 84% in 2019 so far. Thus, we see that the epidemic of spam needed to be taken seriously by major email service mailbox providers.
In 2002, Paul Graham wrote a seminal paper on “A plan for Spam”. In his paper, he had advocated the use of Naive-Bayes Filtering methods which were effective in Email spam filtering. This led to major ISPs using this filtering method for their anti-spam filters.
During this time mailbox providers were keeping out spam with IP/Domain based filters, keyword-based filters, etc. But when this seminal work on Bayesian filtering came out, it changed the perception of Email filtering for a lot of mailbox providers. The idea of using statistical methods for filtering spam was left behind, to be replaced by Artificial Intelligence and Machine Learning.
From the 2010s, the anti-abuse and anti-spam authorities had to take strict measures to set global standards for Email spam filtering. This gave birth to SPF, DKIM, and DMARC as authentication parameters for sending a secure email from one server to another. These standards came in to assist and add more authenticity to legitimate email.
Advanced technologies and robust anti-spam systems were hence developed with the rise of Artificial Intelligence and Neural Networks. Gmail has excelled in using these technologies which we shall be deciphering later in this post.
Early Anti Spam filters and their Evolution
Gmail was launched in 2004 and with much exclusivity. The first few days were tough on Gmail to filter spam. But, Gmail quickly turned around to be superior to most of the other mailbox providers back then and user mailboxes were relatively clean.
Back in the day, Image spam was rampant and Gmail had built advanced techniques like Optical Character Recognition to filter out spam, which was way ahead of the curve.
At the core, Gmail built its filters around user engagement. If sufficient users marked an email as spam, the feedback is captured and fed into a system to filter similar emails from other users’ mailboxes.
Even today, the core essence does not change. Most of the filters that Gmail builds today are centered around User engagement. The main point that Gmail likes to make always is that their users need to receive only those emails in their inbox which they are expecting and they like to engage with.
What has improved is the efficiency of spam filters. Rule-based filters upgraded to Linear Machine learning classifiers, deep learning algorithms, etc.
The war between spammers and mailbox providers is a never-ending one. As the anti-spam algorithms are upgraded, even spammers upgraded their sophistication of attacks. Large scale attacks transformed into guerilla wars with precision and damage. In order to catch these, the spam filters should be extremely vigilant and thanks to always working for Gmail’s team, we’re safe.
The Need for Gmail to apply a robust Anti-Spam Filtering System?
The following are some of the reasons why Gmail had to take stock of the escalating situation of spam and invest in thorough research for anti-abuse and anti-spam purposes:
User Expectations Have Increased
Users now expect their mailbox providers to be on top with issues related to abusers, spammers and other cyber-criminals propagating malicious viruses, worms, trojans via Email. The expectation to keep the mailbox clean has increased, as the technology has got so far advanced that the security system designed to avoid any attacks needs to be on par with the product.
Attacker Sophistication To Spam Filters
As the anti-spam systems have evolved over time, so have the attackers and cyber frauds. New attacks have been reported by Gmail where the level of expertise used to bypass the smart filters has increased.
In the online world of today, the user has the probability of having their banking transactions as well as any other personal details to be available on their Email, hence in case of an attack on a user’s inbox, it could have devastating consequences for the users.
A New Wave Of Everyday Spam Attacks
Artificial intelligence and Machine learning not only helps to detect current vulnerabilities in the system and controls attacks but also prevent them by classifying patterns of detection for any future anomalies. This is what makes the system to be intelligent to prevent spam.
Now that we know the reasons for Gmail to build the system we can deep-dive into how it works.
Gmail filtering mechanism
As of the present state of 2019, Gmail has worked extensively on researching on deep learning techniques like tensor flow and AI/ML classifiers to build their robust anti-spam system. This system will be able to detect spam up to 99.9% accuracy and filter it before it can cause any
inconvenience to their 1 billion users.
The 0.1% is still dependent on unknown novel attacks which can bypass any security system
because of their anomalies. Nothing is perfect and neither are their spam filters.
So let’s dive into how the present state Gmail spam filters work:
The chart above shows the percentage-wise breakdown of Gmail Anti-spam technologies which are used to filter out huge amounts of spam every day and prevent well-co-ordinated spam and cyber attacks on the platform.
As we can see AI/ML-based classifiers have a large share in the say for filtering Emails in the current Gmail anti-spam mechanism.
A classifier helps to segregate objects in classes where they belong, to find similar patterns between those classes of objects. In this case, these classifiers work to segregate spam occurrences in classes based on their individual characteristics. AI helps to generalize the data to identify abstract concepts for spam. It also enables “temporal extrapolation”, which helps it to prevent future spamming based on evidences of spam in the past.
AI/ML can help recognize all the signals for spam and make the best decision possible in the circumstance. Here Deep Learning also helps to improve the amount of data to consider to make a decision.
More Data = Better Defense.
Gmail prevents 3.5% of all advanced phishing and spam attacks by predicting trends!
The classifier decision is explained in terms of similarity to related behavior until now, to ascertain whether it was a known or unknown attack.
Gmail uses specialized models for detecting specific abuse behavior, but making these models on an individual level is extremely challenging.
Challenges of Using AI/ML Classifiers
The dataset under consideration needs to be improved on an everyday basis to prevent new attacks from happening. New models of phishing and spam emails are trained fast enough as attacks evolve over time. The models to be created also have to be highly generalized with enough capacity and training data.
Despite having AI-based automated algorithms, strict monitoring needs to be in place as according to Gmail 97% attachments get scanned for spam every day.
Spammers are intelligent enough to avoid detection and hide activity. Hence the use of spam trap ids or honey pots by Gmail to gather better clarity on such illicit spammers, which helps them to shut them down.
In the end, Spam is a personal concept for each individual as what seems spam email to me, will not be spam for someone else with the same interests. Hence, it is a culture and context-dependent.
In order to overcome this hurdle, each AI model has a product and content that has a particular context that needs to be modelled.
Thus Gmail creates personalized models based on a user’s tolerance of spam. Thus, they let the users decide on what constitutes spam for them and offer a way to control what is blocked.
Eg: Offering the unsubscribe as well as Block feature clicking on emails here on will make them land in spam automatically.
Lack Of Abuse Features
AI/ML classifiers work better on Image + Text classifications. But if the content is of a different layout then other factors come into play:
IP addresses can be used instead to screen the content getting delivered from it and these addresses can be placed on a watch list.
Modeling Temporal Behavior:
The sequences of features that led to spamming can be analyzed all together, eg: A thief who always leaves behind a similar trail of events before committing the robbery, can be analyzed and similar patterns can be derived from them to prevent in future.
Spammers will always have some eccentric behavior in terms of their settings like certain rate limits, content, dataset targeted, etc, which can always provide an outlier for the classifiers by modeling the behavior.
The Gmail classifiers can increase false negative to less than 0.01% and Gmail classified good emails as spam were less than 0.005%.
The Domain reputation has gained prominence since the Google Postmaster tool has become public to users.
If your domain reputation is high, it is an indication by Google that your mailing pattern to your subscribers is on the right track and you are sending relevant content to your users who actually wish to receive them.
But if your reputation is Low or Bad, then there is no way that your entire mailing list will go inbox as Google clearly states that you will have to improve your reputation for improving your Email Deliverability.
Deep Learning- Based Spam Filters
Deep learning is a subset of Machine Learning where algorithms inspired by the structure and function of the brain. This is also called artificial neural networks, where the word “deep” is supposed to symbolize the layered structure of the networks.
Google uses fast supercomputers and large amounts of data to train these networks to think and make smart decisions on their own. This feature is applied across a range of products including Gmail.
Google uses deep learning to analyze historical data and find patterns in that data to make the networks learn and predict any future occurrences of these patterns. In this case, Gmail has used deep learning to analyze patterns of spam, phishing, spoofing or just fraud mails and uses these patterns to predict the next occurrence. As soon as their ultra-fast computing detects a similar pattern for a possible spam attack, Google shuts it down by either rate-limiting or just blocking the mails from entering their servers.
This technology is also used to provide your smart replies and have support in different languages instantly.
This is how the new artificial brain of Google works!
Tensor Flow-Based Spam Filters
Tensor Flow is used by Gmail in order to make the process of Machine learning easier. It allows Gmail to find hidden content embedded in an image, find spammy images or detect the low volume of spam within large legitimate traffic. This technology helps Gmail currently to filter out 100mn
spam messages each day! That is quite effective as it is automated and will require little to no human intervention.
Gmail is currently employing this technology for malware and phishing detection as well.
How does Gmail Deal with Unknown Spam Attacks?
In the Harry Potter series, there were unexpected attacks by Death Eaters on Hogwarts School. There was no escaping the unpredictability of the attacks like Death Eaters sneaking inside the castle via hidden cupboards. In such cases, Dumbledore’s Army had to just deal with the damage and try to prevent it happening the next time!
Similarly, as Email spammers get more sophisticated by the day, it is a possibility that an unknown spam attack can occur and even Gmail might not have the know-how to deal with it.
The anti-spam filters are fairly accurate but security, after all, is about learning from new attacks and putting out the fires.
There are certain actions taken then that help to restrict the damage caused by adversarial attacks:
Rate Limit Probing
Gmail can simply enforce rate-limiting the suspicious emails if they seem to be malicious and not deliver them till they can analyze and determine the issue.
Use Ensemble Learning
The various models of machine learning and deep learning can be combined to increase the robustness of the filters in view of an unknown attack.
Use Of Anomaly Detection
An unknown attack will have some anomaly behavior which can help recognize them as spam. This anomaly detection is an important weapon against such unpredictable attacks.
Black Swan Process
Have a central unit stop the Emails in case they are found to be malicious and block them.
The procedure to do so should be well defined, which would allow a central shut-down for the mails to not get delivered to user inboxes if they contain malware or are fishy in nature.
Future Trends for Email Deliverability
With a view of the above points we know the ways that Gmail will filter your Emails if they are triggering the anti-spam filters. Hence, if you are a good sender and you wish to be able to inbox your Emails and Newsletters for your customers, below are some tips on trends to follow in the future to ensure good Deliverability :
Email Deliverability will depend more on sending hyper personalized relevant content to your audience , thus getting good engagement for your email campaigns.
Interactivity in Email with AMP in Email will boost the time users are spending on the Email client and help them engage dynamically with brands.
The marketing Emails will have to cater to mobile devices as more and more emails are being opened on mobile phones with increasing connectivity.
Thus, we have seen above the use of sophisticated technologies used by Gmail for their anti-spam filters. But we can also see that keeping those spam attacks at bay can be a fire fighting job for Google, which means that sometimes it is reactive and there is a chance for false positives to occur.
So the next time if you are a good sender, yet your Email Deliverability has gone for a toss, you will have to check on your data points mentioned above primarily content, dataset targeted and Reputation. Follow the evolving trends for good Deliverability to get your Emails inboxed and make your Email Program a success.