Spam detection – does the Naive approach work?

Disclaimer: The following views are my own and not the official views of Google, G Suite, or YouTube. It is also not meant to be theoretically rigorous but just meant to trigger thought experiments.

I’ve encountered a number of articles online on spam and abuse detection, a quick search for “spam detection” gives us a number of articles [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]. They all start with labeled datasets and go on to train some Machine Learning model (usually Naive Bayes) to predict if a given piece of text is spam or not. Spam detection is projected as an ML classification problem, mainly dealing with textual data. In this post, I aim to show how it can get a lot more complicated.

Labeled datasets
All of these articles start with labeled datasets with well-defined spam and not-spam labels. Some talk about using public open-source datasets while others omit discussion entirely. There are several questions one can ask around how well those techniques work. How was the dataset labeled? Would the spammer continue to use similar language and a choice of words over time? Would all spammers in the future have a similar objective as in the dataset? Could the spammer get access and learn from the dataset if it is public?

Spam continually changes its appearance to bypass filters

Dynamic systems
Spam detection is a service, not a product. Most corporations have full-time teams to counter spam as the nature of spam keeps evolving. A classifier built using training data from several years ago will likely be of no use as the spam would have changed in appearance, origin, and objective over time. It is critical to have a continuous process of tracking and labeling spam and adapting to counter it. How can this be done? Who provides the ground truth?

Spam is hard for human reviewers to label accurately universally

Human reviewers
Labeling of spam could be a human operation but that would be expensive depending on where they operate. A 2 member team with 12$ minimum wages would cost a company at least 80k$/year. Also, given the evolving nature of spam, additional challenges around how these humans would be trained. Spam is also contextual and the perception varies across cultures. It could be hard for a reviewer in the US to tell if some content would be perceived as spammy or not by say a Japanese user. Privacy constraints while dealing with sensitive data only make the matter worse and additional techniques would be needed to anonymize data.

If user-generated labels are blindly trusted, spammers can control bots or affiliated accounts to mislabel data

User-generated Labels
One could alternatively rely on the end-users of a product to provide the labels since the objective is to protect the users. Why pay when you can get it for free? If such a simple strategy is employed and the product is accessible by the spammer, the spammer can also mislabel spam by say generating thousands of not-spam labels on text containing a phrase, say “get via-gra”. Requiring proof of work and/or requiring authentication for such actions could throttle bad labeling. Additionally throttling the number of labels per account and throttling account creation could significantly contain it. Labeling could be further limited to aged accounts. Does this solve all problems?

The above strategy does solve the problem of bad negative labels however, the spammer can also generate content with legitimate phrases and target trusted aged accounts, forcing them to apply the spam label. This would bring down the goodness of certain phrases and depending on the kind of algorithm that trains on these labels, this could and will likely create collateral damage (false positives). In a replay attack, for example, a spammer can “replay” a valid message with legitimate intent generated in a different context, maliciously. The spammer can turn our dynamic system against us. The spammer would not necessarily gain anything directly with this but that does not guarantee a spammer will not do this. An entity dealing with user trust has a lot more at stake than a spammer who can use this “weapon” in a scenario where it could damage the most, say during a major event. The term adversary or attacker is, therefore, better suited than spammer for the antagonist here.

Early rejection
One approach to solving the problem above would be early rejection. If a data point is blatantly bad, we can reject it early and exclude it from our training corpus entirely. This means we disallow attackers from controlling the dataset quality, either by skewing distributions significantly or by forcing users to label good feature values as bad. This separates privilege and only allows attackers to affect datasets by small amounts and also accounts to only label features where we’re not confident. This prevents poisoning of data.

Wiping a wet floor dry does not stop the leaking pipe. The root causes of the problem are not addressed by learning to identify a particular kind of spammy text.

Cause vs symptoms
The provided text is controlled by the attacker and can be arbitrary. We often think of spam to be an ad for via-gra but it is more likely a promotion of an arbitrary product that the user is disinterested in. A deeper reason why problems like above happened in the first place is as we tried to treat symptoms in the form of text rather than causes of the spam problem (a loose analogy). The occurrence of a word in a spammy context, associated spamminess to the word. Unless the use of particular phrases of text is critical, the spammer can easily return with a different ad using arbitrary inputs, even if we successfully learn to classify a piece of text. The main way to treat the cause of the problem is by choosing features that cannot be easily controlled by the attacker. Choose features where we can raise the bar for an attacker to generate arbitrary feature values or adversarial inputs.

By requiring attackers to spend some resources to secure a proof of identity to enter the playground, we can make it non-trivial to control feature values and we can burn attacker resources for every campaign

Barrier to entry
Giving everyone a voice does not work well in practice. By requiring some kind of proof of identification which has an associated non-zero cost (resource_cost), we could have a minimum bar for entry. Examples of ids and associated costs-

  • Authenticated accounts with SMS verification have the cost of a phone number – password knowledge acts as proof of account ownership.
  • IP addresses have hosting costs – TCP does not allow spoofing of IP addresses so it somewhat acts like a proof of identification.
  • Authenticating domains have registration costs – DKIM and SPF are some ways that allow validating the origin of inputs to certain domains

Rather than (in addition to) learning to classify certain text as bad, the spam classifier could use features hard for the attackers to control instead. The attacker cannot generate arbitrary feature values since a proof of id is required. The detector could simply block input originating for ids that are historically spammy or raise alarm for spammy ones. The features can be granular. For example, in certain countries (a given country code prefix), phone numbers are cheaper to secure and the country-code could be a feature. For certain TLDs (.ga .tk .ml .ga .cf), the domains are free of cost, and certain cloud providers (identified by ASNs), allow easy and programmatic rotation of IP addresses. The domain provider or the IP ASN can be used as features.

Prevention (barrier to entry) is usually better than cure (detection/catch-up) except when you’re trying to justify the impact for a performance review.

Spammers are forced to use hijacked accounts, IPs, and websites to be resource-efficient if a minimum cost of identification is set

Trust gaming
The way the attacker could now bypass the spam classifier is either by spoofing or hijacking identification. This would require using an exploit by incurring a technology_cost (like in the case of the OAuth issue that affected Google in May 2017 where emails came from legitimate users pointing to a legitimate website). The other way to bypass would be by slowly gaining trust by sending legitimate content. After the breach in either of the above scenarios, the attacker has a short duration of time (reaction_time) during which they can generate a finite amount of spam (assuming appropriate throttling at spam_rate) before the dynamic system learns to detect again. Assuming the spammer gets paid a certain amount per view (cost_per_view), the attacker would gain a net positive with the incident if

reaction_time x spam_rate x cost_per_view > resource_cost + technology_cost

The economics
Assuming a rational adversary, the attacker would only spam if the net gain is positive or they go bankrupt over time. The goal of the overall system would be to ensure a net negative gain, either by raising the resource cost (like making it harder to create accounts in bulk) or requiring multiple kinds of resources (IPs and accounts), or by lowering the reaction time and potentially cleaning up of any missed spam. This keeps the mom-and-pop spammers/script-kiddies in check. As the arms race evolves, it leaves behind a small number of professionals. The resource acquisition rate for them is usually limited and for a fixed demand, the cost_per_view would depend on how hard we make it for the spammer to spam. The cost per spam view can act as a metric for success (the higher the value on the black market, the better our performance).

Conclusion: In addition to learning logistic regression and deep learning over text, consider acquiring and applying some domain knowledge [1] [2] [3] [4] [5] [6] [7] and use it to ensure the attackers cannot easily control the quality of the dataset, the labels, and most importantly the feature values. Engineer the right kind of features that are hard to spoof and uneconomical to control. Non-text features can easily counter spam generated by even an advanced AI text generation program. All of the above mainly apply for large scale attacks and not targeted/smaller-scale attacks (which are high cost per view) or non-spam attacks (hate-speech or phishing or social engineering) where the attackers are constrained on the text used.

Please comment and share if you liked this.

Rhyming words using python

Here is some simple python code that uses NLTK libraries to find rhyming words of a given level.

import nltk

def rhyme(inp, level):
entries = nltk.corpus.cmudict.entries()
syllables = [(word, syl) for word, syl in entries if word == inp]
rhymes = []
for (word, syllable) in syllables:
rhymes += [word for word, pron in entries if pron[-level:] == syllable[-level:]]
return set(rhymes)

print “word?”
word = raw_input()
print “level?”
level = input()
print rhyme(word, level)

Wikipedia Offline

Most of what my work online either involves checking mail or browsing forums for getting answers or reading Wikipedia for getting information or social networking. With LAN cuts introduced in the IITs, it is difficult for a student to access information after 12:10 unless they breakout somehow. In an earlier post, I had explained with references to my code, on how to download parts of Wikipedia, I thought it would be helpful to download the whole of Wikipedia on to your computer. In this post I will show you how Wikipedia / stack-overflow / gmail can be download for offline use.



  • LAMP (Linux, Apache, MySQL, PHP)
  • Around 30 GB of space in primary partition 30 GB of space for storage. In my case the root partition
  • 7 GB of free Internet download
  • 3 days of free time

Wikipedia dumps can be downloaded from the Wikipedia site in XML format compressed in .7zip. This is around 6 GBs when compressed and expands to around 25GB of XML pages. It doesn’t include any images. This page shows how one can extract text articles from articles and construct corpuses from the same. Apart from this, a static HTML dump can also be downloaded from Wikipedia page (wikipedia-en-html.tar.7z) and this version has images in it. The compressed version is at 15 GB and it expands to over 200 GB because of all the images.

The Static HTML dump can simply be extracted to get all the HTML files and the required HTML file can be opened to view the required content. In case you download the XML dump, there is more – you have to extract the articles and create your customized offline Wikipedia.with the following steps.

  1. Download the latest mediawiki and install it on your Linux/Windows machine using LAMP/WAMP/XAMPP. Mediawiki is the software that renders Wikipedia articles using the data stored in MySQL.
  2. Mediawiki needs a few extensions which have been installed in Wikipedia.Once we have mediawiki installed say /var/www/wiki/, download each of them and install by extracting these extensions in the /var/www/wiki/extensions directory.
    The following extensions have to be installed – CategoryTree, CharInsert, Cite, ImageMap, InputBox, ParserFunctions (very important), Poem, randomSelection, SyntaxHighlight-GeSHi, timeline, wikihero which can all be found in the Mediawiki extensions download page by following the instructions. In addition you can install any template to make your wiki look like whatever you want. Now your own Wiki is ready for you to use, you can add your own articles but what we want now is to copy the original Wikipedia articles to our Wiki.
  3. It is easy to import all the data once and then construct an index for the data in MySQL than to update the index each time an article is added. Open MySQL and your database, the tables that are used in the import are text, page and revs. You can delete all the indexes on that page and create it again in the 5th step to speed up the process.
  4. Now that we have our XML database, we need to import it into the MySQL database. You can find the instructions here. In short, a summary of the instructions found on that page, the ONLY WAY you can get Wikipedia really fast on your computer is to use mwdumpertool to import into the database. The inbuilt tool in mediawiki won’t work fast and may run for several days. The following command can be used to import the dump into the database within an hour.
    java -jar mwdumper.jar --format=sql:1.5 <dump_name> | mysql -u <username> -p <databasename>
  5.  Recreate the indexes on the tables ‘page’, ‘revs’ and ‘text’ and you are done.

You can comment if you want to try the same or if you run into any problems while trying.



  • LAMP (Linux, Apache, MySQL, PHP)
  • Around 15 GB of space in the primary partition and 15 GB of storage. In my case the root partition
  • 4 GB of free Internet download has several stackoverflow zip files available for direct download. Alternatively, stack-overflow dumps can be downloaded using a torrent. A torrent download can be converted into an FTP download using Once you have the dumps you can unpack them to get huge XML files for several stack sites. Stack-Overflow is one of the stack sites, the 7zip file is broken into 4 parts and have to be combined using a command (cat a.xml b.xml c.xml d.xml > full.xml) Once combined and extracted, we can see 6 xml files for each site (badges, comments, postHistory, posts, users, votes, ) Among these, comments, posts and votes may seem useful for offline usage of the forum. A main post may consist of several reply posts and each such post may have follow-up comments. Votes are used to rate an answer and they can be used as signals while you browse through questions. Follow the following steps to import the data into the database and use the UI to browse posts offline.

  • Download Stack sites
  • Create a database StackOverflow with the schema using the description here. (comments, posts and votes tables are enough)
  • Use the code to import the data to the database. (Suitably modify the variables serveraddress, port, username, password, databasename, rowspercommit, filePath and site in the code)
  • Run the code on Stack Mathematics to import the mathematics site. For bigger sites, it may take much more time and a lot of optimizations are needed along with a lot of disk space in the primary partition where the MySQL stores its databases.
  • Use the UI php files to view a post given the post number along with the comments and replies.
  • TODO: Additionally we can add a search engine that searches the table ‘posts’ for queries and returns post numbers which match the same.
Gmail offline
  • Windows / Mac prefered
  • Firefox prefered
  • 20 minutes for setup
  • 1 hour for download
Gmail allows offline usage of mails, chats, calendar data and contacts. You can follow the following simple series of steps to get gmail on your computer.
  • Install Google gears for firefox
    • You can install google gears from the site
    • If you are on Linux, you can install gears package. [sudo apt-get install xul-ext-gears]
    • Note: Gears works well in Windows, may fail on Linux
  • Login to gmail
  • Create new label “offline-download”
  • Create a filter {[subject contains: “Chat with”] or [from: <user-name>] -> add label “offline-download” to selectively download your conversations.
  • Enable offline Gmail in settings, and allow download “offline-downloa” for 5 years. You can select the period of time as well.
  • Start, it will end in around an hour and you will have your mails on your computer in an hour.
Offline gmail creates a database called [emailID] in your computer. The gears site gives you the location. You can find some information about offline GMail here.
If you want a custom interface for your mails / chats etc, you can create one which queries the SQLITE database mentioned above to present the content however you want. The software diarymaker can be used to read your chat data with plots of frequencies with time and rank your friends based on the interactivity. It works on Linux and uses the Qt platform. I will add a post on it soon.
Feel free to comment on any issue, if you have an idea for downloading any other kind of data on to your computer for offline usage, please let us know with a comment.
Update: Now you can download stackoverflow directly. (Courtesy: Sagar Borkar)

Wikipedia Mining

Wikipedia has several million articles as of today and it is an excellent source for both structured and unstructured data. The Egnlish Wikipedia gets around 30K requests per second to its 3.5 million articles which is contributed by over 150 million active users. This post tries to  bring out some of my experiences in mining Wikipedia.

Infobox example
Infobox in a Wikipedia article

Wikipedia has infoboxes that are sources of structured information. The screenshot attached is an example of an infobox. One can create a quizzing application using the structured infobox data in Wikipedia. The whole of Wikipedia is huge. It is difficult to download and process the whole of it. A small section for example, cricket, can be selected and all the articles from under the category can be downloaded by recursively crawling all the sub-categories and articles in the Wikipedia article. There are several libraries that can be used for the same. In python, a library called beautiful soup can achieve the same.

The code here prints a category tree. A category page like this has a list of sub-categories, expandable bullets and a list of pages. The function printTree is called on each article. It lists and downloads the ’emptyBullets’ (leaf-categories) and ‘fullBullets’ (expandable-categories) and recursively calls the function for the sub-category. It avoids re-downloading duplicate pages. The getBullets function uses the Beautiful soup library to get all ‘<a>’ tags in the HTML article which have tags ‘CategoryTreeEmptyBullet’ and ‘CategoryTreeBullet’, the other functions seem simple enough to comprehend. The outcome of this function is the Category tree in Wikipedia. The function ‘download’ which uses wget can be replaced with a better download function that probably uses curl or some python HTTP library.

I have attempted to create a simple rule based ‘Named Entity Recognition’ program to classify these articles into people / organizations / tours and so on. This python file does the same. Here is a named-entity tagged version of the category.

Using the list of categories, we can get the list of articles in each category using this file, and the articles can be downloaded using this file. The code is pretty simple and I am mentioning these files here for the sake of completion. The body(article content) can be extracted using this file.

This file takes in an article and extracts the information in the infobox and outputs them in a structured format. I have written custom processors for processing infoboxes of people and infoboxes of people and matches and so on.

Here are examples of final outputs. (Article body of Sachin Tendulkar, Infobox extracted from article Sachin Tendulkar, an infobox on the Indian tour of England in 2002.)

These outputs are in my own structure but they can be standardized to some format for example XML/JSON/RDF. A script can be used to do the same.


In another 6 days, my summer internship program at Google will begin. I have been waiting for over 7 months for this to happen. Google booked a flight for me to get to Bangalore and a cab to take me home from the airport. I am expecting the best from Google. So how did I get here? Since this is my tech blog and not the philosophical one, this post is about how I got my internship at Google.

I faced 3 rounds of telephonic interviews involving questions on algorithms and problem solving in a span of 3 weeks. This was my second interview. The first one by Microsoft did not go that well. I RG-ed myself in IITM lingo (caused my own failure). Learning from my mistakes, I was well prepared for the second one. One thing that I learnt from my mistakes was that I did not speak much about myself. I thought of everything that I had done and planned on what to speak about myself with the interviewer after that. I had been SPOJing out of interest and addiction. That did help me a lot.

My first interview lasted around 50 minutes. I spent the first 15 minutes introducing myself. Then started the programming round where I had to write a piece of code for the atof function. Although the task seems simple, converting an error free string to a floating point number, the actual work involved writing error free code and tracing the code like a machine and analysing the function to find the exact number of multiply-divide operations. My initial code took 2*N steps, I had to reduce it to 1*N multiplications using simple optimisation principles. The next question involved mapping an arbitrarily sized string to a number. That is quite trivial as its just conversion of bases. The final question involved generation of all possible subsets for a given set. A binary tree with recursion or a bit-vector iteration from 0 to 2^N are simple ways of generating all possible subsets of a set of size N. The interviewers were happy and I was selected for the next round scheduled after a week and a half.

The difficulty level of the questions increased with each round. This is clear considering the simplicity of the questions in the first round. The second round was interesting. The algorithmic question asked me to find out the total number of ways in which 2*N number of people sitting around a round table can shake hands so that no hands cross. I wasn’t aware about this problem and it turned out to have some interesting results as I discovered during the problem solving session. I quickly suggested a recursive approach where we divide the table, recursively calculate the value for the two halves, multiply them and sum it over all possible divisions of the table. The image in this link should make it clear. This was done in the first few minutes and I suggested a pseudo-code for the same. The interviewer then asked me to make it more efficient and gave me a hint asking me the number of times the value F(10) was calculated while calculating f(20). That suggested I had to use dynamic programming and memoisation. I suggested an alternate version for the same where you maintain an array ‘F’ and remember the outputs of the function ‘f’. I further suggested an iterative alternative for the recursion where f(n) was calculated after all f(i), i < n are calculated. The interviewer gave the last hint through which I realised that I was calculating nothing but Catalan numbers (A simple recurrence relation). The rest of the interview involved some simple questions on C++ language right out of the text book. (Private constructors, static functions, object method invocation after destruction …) The second interview was over in just half an hour.

The final interview was scheduled a couple of days after the second. The first question was a graph theoretical question where I had to find the maximum number of edges in an DAG. The right side implication is just a proof by example and the left side implication was proving that only nC2 pairs of unordered pairs (a, b) can be constructed. The next question was the first one where I actually wrote code. I had to write a program to find the number of 1’s (S(n)) in the series (0, 1, 10, 11, 100, 101 … B(n)) where B(n) is the binary representation of n. Eg S(5) is the number of 1’s in (0, 1, 10, 11, 100, 101). S(5) = 7.

I  remembered something that I learnt in my combinatorics class which suggested S(2^k – 1) is easy to calculate. S(2^k – 1) = k * 2^(k-1) is a very easy formula to derive using summation and I did that in the first 5 minutes.

000  001
010  011
100  101
110  111

S(7) = 3*(2^3) = 12

I had to write code which worked for any n. This involved extracting the first bit of the number and processing the rest of it. I had a vague recurrence relation in my mind and I spent over half an hour trying to ‘write’ the code which worked for all end cases. There was a small bug in my code and I could not fix it in time and the interviewer moved on to the next question. I could have typed out the code in my computer and executed and tested instead of manually tracing the program with code written on a piece of paper.

The last question involved suggesting a strategy for Google to suggest alternate search results in case the user makes an error. The “Did you mean …” part in a google page like here. I suggested naive strategies like flipping the characters in the query to see if better search results are obtained. The idea is clearly not scalable because of the sheer number of ways the user might have made the errors. I further went on to suggest using the keyboard key-placement and restricting the character flipping to just the neighbouring keys. This again wasn’t good enough. The interviewer wanted a better approach. I though for a while and suggested a directed graph structure for queries where there are directed edges from one string to another if there is a high probability of the suggestion. The edge-weights could be proportional to the probability of error. The closest few neighbours of a given node would list the most likely queries. A simple BFS could reveal the closest neighbours. The problem now was in the assignment of weights to the edges for which I had to suggest an algorithm. I thought on the lines of people querying for something, not being happy with the results, changing the query immediately to get a good response. The edge between query1 and query2 could be inversely proportional to the number of times the same happens. The interviewer was pleased with the approach. I then requested him to give me a few minutes to complete the program. I fixed the bug and submitted the code and it worked great in logarithmic time. (Linear on the number of bits in ‘n’)

In a couple of weeks, the HR called me up and asked me to send my certificates and informed me that I had been selected for internship at Google. 😀

Update: has a collection of images that describe the environment in Google. Looks awesome!