
Disclaimer: The following views are my own and not the official views of Google, G Suite, or YouTube. It is also not meant to be theoretically rigorous but just meant to trigger thought experiments.
I’ve encountered a number of articles online on spam and abuse detection, a quick search for “spam detection” gives us a number of articles [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]. They all start with labeled datasets and go on to train some Machine Learning model (usually Naive Bayes) to predict if a given piece of text is spam or not. Spam detection is projected as an ML classification problem, mainly dealing with textual data. In this post, I aim to show how it can get a lot more complicated.
Labeled datasets
All of these articles start with labeled datasets with well-defined spam and not-spam labels. Some talk about using public open-source datasets while others omit discussion entirely. There are several questions one can ask around how well those techniques work. How was the dataset labeled? Would the spammer continue to use similar language and a choice of words over time? Would all spammers in the future have a similar objective as in the dataset? Could the spammer get access and learn from the dataset if it is public?

Dynamic systems
Spam detection is a service, not a product. Most corporations have full-time teams to counter spam as the nature of spam keeps evolving. A classifier built using training data from several years ago will likely be of no use as the spam would have changed in appearance, origin, and objective over time. It is critical to have a continuous process of tracking and labeling spam and adapting to counter it. How can this be done? Who provides the ground truth?

Human reviewers
Labeling of spam could be a human operation but that would be expensive depending on where they operate. A 2 member team with 12$ minimum wages would cost a company at least 80k$/year. Also, given the evolving nature of spam, additional challenges around how these humans would be trained. Spam is also contextual and the perception varies across cultures. It could be hard for a reviewer in the US to tell if some content would be perceived as spammy or not by say a Japanese user. Privacy constraints while dealing with sensitive data only make the matter worse and additional techniques would be needed to anonymize data.

User-generated Labels
One could alternatively rely on the end-users of a product to provide the labels since the objective is to protect the users. Why pay when you can get it for free? If such a simple strategy is employed and the product is accessible by the spammer, the spammer can also mislabel spam by say generating thousands of not-spam labels on text containing a phrase, say “get via-gra”. Requiring proof of work and/or requiring authentication for such actions could throttle bad labeling. Additionally throttling the number of labels per account and throttling account creation could significantly contain it. Labeling could be further limited to aged accounts. Does this solve all problems?
Weaponization
The above strategy does solve the problem of bad negative labels however, the spammer can also generate content with legitimate phrases and target trusted aged accounts, forcing them to apply the spam label. This would bring down the goodness of certain phrases and depending on the kind of algorithm that trains on these labels, this could and will likely create collateral damage (false positives). In a replay attack, for example, a spammer can “replay” a valid message with legitimate intent generated in a different context, maliciously. The spammer can turn our dynamic system against us. The spammer would not necessarily gain anything directly with this but that does not guarantee a spammer will not do this. An entity dealing with user trust has a lot more at stake than a spammer who can use this “weapon” in a scenario where it could damage the most, say during a major event. The term adversary or attacker is, therefore, better suited than spammer for the antagonist here.
Early rejection
One approach to solving the problem above would be early rejection. If a data point is blatantly bad, we can reject it early and exclude it from our training corpus entirely. This means we disallow attackers from controlling the dataset quality, either by skewing distributions significantly or by forcing users to label good feature values as bad. This separates privilege and only allows attackers to affect datasets by small amounts and also accounts to only label features where we’re not confident. This prevents poisoning of data.

Cause vs symptoms
The provided text is controlled by the attacker and can be arbitrary. We often think of spam to be an ad for via-gra but it is more likely a promotion of an arbitrary product that the user is disinterested in. A deeper reason why problems like above happened in the first place is as we tried to treat symptoms in the form of text rather than causes of the spam problem (a loose analogy). The occurrence of a word in a spammy context, associated spamminess to the word. Unless the use of particular phrases of text is critical, the spammer can easily return with a different ad using arbitrary inputs, even if we successfully learn to classify a piece of text. The main way to treat the cause of the problem is by choosing features that cannot be easily controlled by the attacker. Choose features where we can raise the bar for an attacker to generate arbitrary feature values or adversarial inputs.

Barrier to entry
Giving everyone a voice does not work well in practice. By requiring some kind of proof of identification which has an associated non-zero cost (resource_cost), we could have a minimum bar for entry. Examples of ids and associated costs-
- Authenticated accounts with SMS verification have the cost of a phone number – password knowledge acts as proof of account ownership.
- IP addresses have hosting costs – TCP does not allow spoofing of IP addresses so it somewhat acts like a proof of identification.
- Authenticating domains have registration costs – DKIM and SPF are some ways that allow validating the origin of inputs to certain domains
Rather than (in addition to) learning to classify certain text as bad, the spam classifier could use features hard for the attackers to control instead. The attacker cannot generate arbitrary feature values since a proof of id is required. The detector could simply block input originating for ids that are historically spammy or raise alarm for spammy ones. The features can be granular. For example, in certain countries (a given country code prefix), phone numbers are cheaper to secure and the country-code could be a feature. For certain TLDs (.ga .tk .ml .ga .cf), the domains are free of cost, and certain cloud providers (identified by ASNs), allow easy and programmatic rotation of IP addresses. The domain provider or the IP ASN can be used as features.
Prevention (barrier to entry) is usually better than cure (detection/catch-up) except when you’re trying to justify the impact for a performance review.

Trust gaming
The way the attacker could now bypass the spam classifier is either by spoofing or hijacking identification. This would require using an exploit by incurring a technology_cost (like in the case of the OAuth issue that affected Google in May 2017 where emails came from legitimate users pointing to a legitimate website). The other way to bypass would be by slowly gaining trust by sending legitimate content. After the breach in either of the above scenarios, the attacker has a short duration of time (reaction_time) during which they can generate a finite amount of spam (assuming appropriate throttling at spam_rate) before the dynamic system learns to detect again. Assuming the spammer gets paid a certain amount per view (cost_per_view), the attacker would gain a net positive with the incident if
reaction_time x spam_rate x cost_per_view > resource_cost + technology_cost
The economics
Assuming a rational adversary, the attacker would only spam if the net gain is positive or they go bankrupt over time. The goal of the overall system would be to ensure a net negative gain, either by raising the resource cost (like making it harder to create accounts in bulk) or requiring multiple kinds of resources (IPs and accounts), or by lowering the reaction time and potentially cleaning up of any missed spam. This keeps the mom-and-pop spammers/script-kiddies in check. As the arms race evolves, it leaves behind a small number of professionals. The resource acquisition rate for them is usually limited and for a fixed demand, the cost_per_view would depend on how hard we make it for the spammer to spam. The cost per spam view can act as a metric for success (the higher the value on the black market, the better our performance).
Conclusion: In addition to learning logistic regression and deep learning over text, consider acquiring and applying some domain knowledge [1] [2] [3] [4] [5] [6] [7] and use it to ensure the attackers cannot easily control the quality of the dataset, the labels, and most importantly the feature values. Engineer the right kind of features that are hard to spoof and uneconomical to control. Non-text features can easily counter spam generated by even an advanced AI text generation program. All of the above mainly apply for large scale attacks and not targeted/smaller-scale attacks (which are high cost per view) or non-spam attacks (hate-speech or phishing or social engineering) where the attackers are constrained on the text used.
Please comment and share if you liked this.