Classifiers, Natural Language Processing

Email Spam Filtering : A python implementation with scikit-learn

Date: January 23, 2017Author: Abhijeet Kumar 151 Comments

Text mining (deriving information from text) is a wide field which has gained popularity with the huge text data being generated. Automation of a number of applications like sentiment analysis, document classification, topic classification, text summarization, machine translation, etc has been done using machine learning models.

Spam filtering is a beginner’s example of document classification task which involves classifying an email as spam or non-spam (a.k.a. ham) mail. Spam box in your Gmail account is the best example of this. So lets get started in building a spam filter on a publicly available mail corpus. I have extracted equal number of spam and non-spam emails from Ling-spam corpus. The extracted subset on which we will be working can be downloaded from here.

We will walk through the following steps to build this application :

1. Preparing the text data.
2. Creating word dictionary.
3. Feature extraction process
4. Training the classifier

Further, we will check the results on test set of the subset created.

1. Preparing the text data.

The data-set used here, is split into a training set and a test set containing 702 mails and 260 mails respectively, divided equally between spam and ham mails. You will easily recognize spam mails as it contains *spmsg* in its filename.

In any text mining problem, text cleaning is the first step where we remove those words from the document which may not contribute to the information we want to extract. Emails may contain a lot of undesirable characters like punctuation marks, stop words, digits, etc which may not be helpful in detecting the spam email. The emails in Ling-spam corpus have been already preprocessed in the following ways:

a) Removal of stop words – Stop words like “and”, “the”, “of”, etc are very common in all English sentences and are not very meaningful in deciding spam or legitimate status, so these words have been removed from the emails.

b) Lemmatization – It is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. For example, “include”, “includes,” and “included” would all be represented as “include”. The context of the sentence is also preserved in lemmatization as opposed to stemming (another buzz word in text mining which does not consider meaning of the sentence).

We still need to remove the non-words like punctuation marks or special characters from the mail documents. There are several ways to do it. Here, we will remove such words after creating a dictionary, which is a very convenient method to do so since when you have a dictionary, you need to remove every such word only once. So cheers !! As of now you don’t need to do anything.

2. Creating word dictionary.

A sample email in the data-set looks like this:

Subject: posting

hi , ' m work phonetics project modern irish ' m hard source . anyone recommend book article english ? ' , specifically interest palatal ( slender ) consonant , work helpful too . thank ! laurel sutton ( sutton @ garnet . berkeley . edu

It can be seen that the first line of the mail is subject and the 3rd line contains the body of the email. We will only perform text analytics on the content to detect the spam mails. As a first step, we need to create a dictionary of words and their frequency. For this task, training set of 700 mails is utilized. This python function creates the dictionary for you.

def make_Dictionary(train_dir):
    emails = [os.path.join(train_dir,f) for f in os.listdir(train_dir)]    
    all_words = []       
    for mail in emails:    
        with open(mail) as m:
            for i,line in enumerate(m):
                if i == 2:  #Body of email is only 3rd line of text file
                    words = line.split()
                    all_words += words
    
    dictionary = Counter(all_words)
    # Paste code for non-word removal here(code snippet is given below) 
    return dictionary

Once the dictionary is created we can add just a few lines of code written below to the above function to remove non-words about which we talked in step 1. I have also removed absurd single characters in the dictionary which are irrelevant here. Do not forget to insert the below code in the function def make_Dictionary(train_dir).

list_to_remove = dictionary.keys()
for item in list_to_remove:
    if item.isalpha() == False: 
        del dictionary[item]
    elif len(item) == 1:
        del dictionary[item]
dictionary = dictionary.most_common(3000)

Dictionary can be seen by the command print dictionary. You may find some absurd word counts to be high but don’t worry, it’s just a dictionary and you always have the scope of improving it later. If you are following this blog with provided data-set, make sure your dictionary has some of the entries given below as most frequent words. Here I have chosen 3000 most frequently used words in the dictionary.

[('order', 1414), ('address', 1293), ('report', 1216), ('mail', 1127), ('send', 1079), ('language', 1072), ('email', 1051), ('program', 1001), ('our', 987), ('list', 935), ('one', 917), ('name', 878), ('receive', 826), ('money', 788), ('free', 762)

3. Feature extraction process.

Once the dictionary is ready, we can extract word count vector (our feature here) of 3000 dimensions for each email of training set. Each word count vector contains the frequency of 3000 words in the training file. Of course you might have guessed by now that most of them will be zero. Let us take an example. Suppose we have 500 words in our dictionary. Each word count vector contains the frequency of 500 dictionary words in the training file. Suppose text in training file was “Get the work done, work done” then it will be encoded as [0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]. Here, all the word counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the rest are zero.

The below python code will generate a feature vector matrix whose rows denote 700 files of training set and columns denote 3000 words of dictionary. The value at index ‘ij’ will be the number of occurrences of j^th word of dictionary in i^th file.

def extract_features(mail_dir): 
    files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files),3000))
    docID = 0;
    for fil in files:
      with open(fil) as fi:
        for i,line in enumerate(fi):
          if i == 2:
            words = line.split()
            for word in words:
              wordID = 0
              for i,d in enumerate(dictionary):
                if d[0] == word:
                  wordID = i
                  features_matrix[docID,wordID] = words.count(word)
        docID = docID + 1     
    return features_matrix

4. Training the classifiers.

Here, I will be using scikit-learn ML library for training classifiers. It is an open source python ML library which comes bundled in 3rd party distribution anaconda or can be used by separate installation following this. Once installed, we only need to import it in our program.

I have trained two models here namely Naive Bayes classifier and Support Vector Machines (SVM). Naive Bayes classifier is a conventional and very popular method for document classification problem. It is a supervised probabilistic classifier based on Bayes theorem assuming independence between every pair of features. SVMs are supervised binary classifiers which are very effective when you have higher number of features. The goal of SVM is to separate some subset of training data from rest called the support vectors (boundary of separating hyper-plane). The decision function of SVM model that predicts the class of the test data is based on support vectors and makes use of a kernel trick.

Once the classifiers are trained, we can check the performance of the models on test-set. We extract word count vector for each mail in test-set and predict its class(ham or spam) with the trained NB classifier and SVM model. Below is the full code for spam filtering application. You have to include the two functions we have defined before in step 2 and step 3.

import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.svm import SVC, NuSVC, LinearSVC
from sklearn.metrics import confusion_matrix 
# Create a dictionary of words with its frequency

train_dir = 'train-mails'
dictionary = make_Dictionary(train_dir)

# Prepare feature vectors per training mail and its labels

train_labels = np.zeros(702)
train_labels[351:701] = 1
train_matrix = extract_features(train_dir)

# Training SVM and Naive bayes classifier

model1 = MultinomialNB()
model2 = LinearSVC()
model1.fit(train_matrix,train_labels)
model2.fit(train_matrix,train_labels)

# Test the unseen mails for Spam
test_dir = 'test-mails'
test_matrix = extract_features(test_dir)
test_labels = np.zeros(260)
test_labels[130:260] = 1
result1 = model1.predict(test_matrix)
result2 = model2.predict(test_matrix)
print confusion_matrix(test_labels,result1)
print confusion_matrix(test_labels,result2)

Checking Performance

Test-set contains 130 spam emails and 130 non-spam emails. If you have come so far, you will find below results. I have shown the confusion matrix of the test-set for both the models. The diagonal elements represents the correctly identified(a.k.a. true identification) mails where as non-diagonal elements represents wrong classification (false identification) of mails.

Multinomial NB	Ham	Spam
Ham	129	1
Spam	9	121

SVM(Linear)	Ham	Spam
Ham	126	4
Spam	6	124

Both the models had similar performance on the test-set except that the SVM has slightly balanced false identifications. I must remind you that the test data was neither used in creating dictionary nor in the training set.

Task for you

Download the pre-processed form of Euron-spam corpus. The corpus contains 33716 emails in 6 directories. Each of 6 directories contains ‘ham’ and ‘spam’ folders. Total number of non-spam emails and spam emails are 16545 and 17171 respectively.

Follow the same steps described in this blog post and check how is it performing with Support Vector Machines and Multinomial Naive Bayes models. As the directory structure of this corpus is different than the directory structure of ling-spam subset used in the blog post, you may have to either reorganize it or do modifications in def make_Dictionary(dir) and def extract_features(dir) functions.

I divided the Euron-spam corpus into training set and test set in 60:40 split. After performing the same steps of this blog, i got the following results on 13487 test set emails. We can see that SVM has performed slightly better than Naive Bayes classifier in detecting spam emails correctly.

Multinomial NB	Ham	Spam
Ham	6445	225
Spam	137	6680

SVM(Linear)	Ham	Spam
Ham	6490	180
Spam	109	6708

Final Thoughts

Hope it was easy to go through tutorial as I have tried to keep it short and simple. Beginners who are interested in text analytics can start with this application.

You might be thinking about the mathematical techniques behind the used models like Naive Bayes and SVM. SVM is mathematically complex model where as Naive bayes is relatively easy to understand. You are encouraged to study about these models from online sources. Apart from that, there can be a lot of experiments that can be done in order to find the effect of various parameters like

a) Amount of training data
b) Dictionary size
c) Variants of the ML techniques used (GaussianNB, BernoulliNB, SVC)
d) Fine tuning of parameters of SVM models
e) Improving the dictionary by eliminating insignificant words (may be manually)
f) Some other feature (look for td-idf)

I will be writing the mathematical explanation about these models in some another blog-posts some other time.

You can get the full python implementation for both the corpus from GitHub link here.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂

151 thoughts on “Email Spam Filtering : A python implementation with scikit-learn”

Add Comment

praveen kumar karn says:

September 24, 2018 at 5:21 pm

IF you can provide me code in lingspam.py .In that sense if i enter any sentence then it must identify that as spam or non-spam.
For example if i enter sentence like”hey you have won free tickets for football match so contact me 9875789067″ then it display as spam or ham………so plz provide me code for this criteria in linspam.py

LikeLike

Reply
1. Tristan El Jed says:
  
  November 22, 2018 at 8:26 am
  
  I made a UI here using Flask:
  https://github.com/tristaneljed/Spamector
  My work is based on this tutorial but I added more classifiers and examples. Hope you’ll enjoy it. I have a video demo if you want the link.
  
  LikeLike
  
  Reply
  1. Maribel says:
    
    May 5, 2019 at 7:08 pm
    
    hola me gustaría ver el enlace de tu vídeo si me lo puede pasar te lo agradecería mucho
    
    LikeLike
    
    Reply
Bishon says:

September 25, 2018 at 2:44 pm

What are the keywords that make a email Spam? What is the logic behind this algorithm.Can you please explain in brief.I didn’t get the logic.

LikeLiked by 1 person

Reply
1. Tristan El Jed says:
  
  November 22, 2018 at 8:28 am
  
  I made a UI here using Flask:
  https://github.com/tristaneljed/Spamector
  My work is based on this tutorial but I added more classifiers and examples. Hope you’ll enjoy it. I have a video demo if you want the link.
  
  LikeLike
  
  Reply
Randheer Reddy says:

November 9, 2018 at 5:22 am

Hi Abhijeeth,

I liked your content very much and this is my first project in python and Im finding difficulty in debugging this error .Please help in fixing this .

—————————————————————————
IndexError Traceback (most recent call last)
in ()
10 if d[0] == word:
11 wordID = i
—> 12 features_matrix[docID,wordID]=words.count(word)
13 docID = docID + 1

IndexError: index 700 is out of bounds for axis 0 with size 700

LikeLike

Reply
1. Tristan El Jed says:
  
  November 22, 2018 at 8:26 am
  
  I made a UI here using Flask:
  https://github.com/tristaneljed/Spamector
  My work is based on this tutorial but I added more classifiers and examples. Hope you’ll enjoy it. I have a video demo if you want the link.
  
  LikeLike
  
  Reply
Prakhar Shreshtha says:

November 21, 2018 at 3:12 pm

Thanx for ur post.Can u pls tell how do we predict for a random email,means which is not in this dataset,if it is spam or not?

LikeLike

Reply
1. Abhijeet Kumar says:
  
  November 22, 2018 at 7:26 am
  
  Hi Prakhar,
  
  For any given email. It is easy to predict spam or not.
  1. You need to read the text of email.
  2. Extract the word count vector from the text (as done here in program using the dictionary of training data).
  3. Further , predict the class(spam or non-spam) from the already trained model here.
  
  LikeLike
  
  Reply
2. Tristan El Jed says:
  
  November 22, 2018 at 8:27 am
  
  I made a UI here using Flask:
  https://github.com/tristaneljed/Spamector
  My work is based on this tutorial but I added more classifiers and examples. Hope you’ll enjoy it. I have a video demo if you want the link.
  
  LikeLike
  
  Reply
Hassan says:

November 22, 2018 at 5:05 pm

sir can you please explain what data is in the “features_matrix” and “train_labels”

LikeLiked by 1 person

Reply
1. Abhijeet Kumar says:
  
  November 26, 2018 at 1:44 am
  
  Hi Hassan,
  
  train_labels – There are 702 emails. ‘train_labels’ labels them 0 if it is ham and 1 if it is spam emails. The first half is labels 0 and other half is labelled 1. It is necessary to generate labels to apply a supervised classification model.
  
  features_matrix – It is a matrix where rows are number of email files and columns are words in dictionary. So, each row represents the count of words (in columns) occurring in that email file. Dimension of features_matrix will be number of emails * words dictionary size
  
  Thanks.
  
  LikeLike
  
  Reply
aditi says:

November 29, 2018 at 11:45 am

hiii… I am not able to run this code at all. I don’t know what’s the problem. but can you tell me all the necessary changed I should do for python 3?

LikeLike

Reply
1. Abhijeet Kumar says:
  
  December 5, 2018 at 10:44 am
  
  Hope you might have found out by now. You can check the comments above. Similar issues have been asked by people in comment section.
  
  LikeLike
  
  Reply
praveen kumar karn says:

December 22, 2018 at 1:20 pm

How can we classify mail into spam,non-spam,social,promotion??plz give me idea about that

LikeLiked by 1 person

Reply
1. Abhijeet Kumar says:
  
  December 25, 2018 at 11:26 am
  
  It can be done in similar way using multi-class classification.
  Instead of 2 classes (0,1), you can label 4 classes (0,1,2,3).
  
  LikeLike
  
  Reply
  1. praveen kumar karn says:
    
    January 4, 2019 at 1:08 pm
    
    Many Many thank u to ur response & help from yours ..i have completed this project of master degree (3rd sem). But now i want to upgrade this topics as my master degree on communication& knowledge engineering thesis on ” Comparative & performance Analysis of Spam mail identification using LSTM & SVM” which main objective is:
    -To Analyze the e-mail and classify it into spam ,non-spam,Social,promotion using SVM and LSTM.
    -To compare the result using different dataset & choose best method.
    So i want ur help if you can provide me different dataset above 10,000 & code on python platform.
    
    LikeLike
    
    Reply
    1. marcadoris says:
      
      May 18, 2019 at 3:19 pm
      
      Hello, excuse me for bothering you, maybe you can help me with your code to be able to guide me and be able to complete my project, I thank you for being attentive
      
      LikeLike
      
      Reply
  2. praveen kumar karn says:
    
    January 4, 2019 at 1:10 pm
    
    Many Many thank u for ur response & help .Without ur help it couldn’t be possible. i have completed this project of master degree (3rd sem). But now i want to upgrade this topics as my master degree on communication& knowledge engineering thesis on ” Comparative & performance Analysis of Spam mail identification using LSTM & SVM” which main objective is:
    -To Analyze the e-mail and classify it into spam ,non-spam,Social,promotion using SVM and LSTM.
    -To compare the result using different dataset & choose best method.
    So i want ur help if you can provide me different dataset above 10,000 & code on python platform.
    I am waiting for your response very soon…plz plz help me
    
    LikeLiked by 1 person
    
    Reply
    1. Abhijeet Kumar says:
      
      January 5, 2019 at 6:48 am
      
      Glad to know that you completed your project.
      
      Regarding dataset have a look at https://www.quora.com/What-are-some-good-email-based-data-sets-for-testing-spam-classification-algorithms
      You may want to combine data-set to make a large data-set.
      Euron-spam dataset can alone be good enough for LSTM model. Though LSTM may show better results when trained on larger dataset.
      
      LSTMs implementation of text classification you can check on kaggle or internet blogs.
      On my blog, https://appliedmachinelearning.blog/2017/12/21/predict-the-happiness-on-tripadvisor-reviews-using-dense-neural-network-with-keras-hackerearth-challenge/ can help you to start with.
      
      LikeLike
      
      Reply
    2. marcadoris says:
      
      May 18, 2019 at 3:20 pm
      
      Hello, excuse me for bothering you, maybe you can help me with your code to be able to guide me and be able to complete my project, I thank you for being attentive
      
      LikeLike
      
      Reply
  3. praveen kumar karn says:
    
    March 22, 2019 at 2:21 pm
    
    IS it possible with this dataset of enron & lingspam ? plz..plz reply me soon
    
    LikeLike
    
    Reply
Pingback: Filtrado de correos SPAM con scikit-learn (Python) – inxeon
Yashaswi Kandimalla (@YashaswiKandim1) says:

February 21, 2019 at 4:42 am

I amnot able to access the ling-spam corpus files.I am getting permission denied in error while executing program in python.can you help me in solving this issue?

LikeLike

Reply
1. Abhijeet Kumar says:
  
  February 21, 2019 at 12:52 pm
  
  Can you explain more !!
  Which OS are you using ?
  Are you executing from command promt ? Make sure you have proper rights or make sure you have opened cmd with administrator rights.
  
  LikeLike
  
  Reply
praveen kumar karn says:

February 25, 2019 at 1:03 pm

I am getting problem like below in enron-spamfilter file…..how to solve it.plz help me

File “C:\Users\HP\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 705, in runfile
execfile(filename, namespace)

File “C:\Users\HP\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 102, in execfile
exec(compile(f.read(), filename, ‘exec’), namespace)

File “D:/project/Mail-Spam-Filtering-master/enron-spamfilter.py”, line 72, in
dictionary = make_Dictionary(dir)

File “D:/project/Mail-Spam-Filtering-master/enron-spamfilter.py”, line 32, in make_Dictionary
for item in list_to_remove:

RuntimeError: dictionary changed size during iteration

LikeLike

Reply
1. PC says:
  
  April 25, 2019 at 11:37 am
  
  facing same issue
  
  LikeLike
  
  Reply
2. Ahriev says:
  
  April 28, 2019 at 5:01 pm
  
  it’s because he is using phyton 2.x and we are using phyton 3.x
  the problem is because of the program used “keys” in list_to_remove = dictionary.keys()… And how to fix that… Right now I also don’t know how to fix this issued 😦
  
  LikeLike
  
  Reply
  1. Abhijeet Kumar says:
    
    April 28, 2019 at 5:09 pm
    
    Hi readers,
    This is happening because it’s an old post when I used to code in python2. Just convert it in python3.
    For the above problem just try this.
    
    list_to_remove = list(dictionary.keys())
    
    May be I will migrate the whole code to python3 soon.
    Thanks.
    
    LikeLike
    
    Reply
    1. Ahriev says:
      
      April 28, 2019 at 7:22 pm
      
      Yap… it’s work well now… thank you… for phyton 3 just use () in print and your program is still running well
      
      LikeLike
      
      Reply
    2. Ahriev says:
      
      April 28, 2019 at 7:56 pm
      
      And… open (mail) –> open(mail,encoding=”Latin-1″) (for the enron corpus)
      But I still running the program because the enron dataset is quite big… I will update again soon if there are any changes needed. Thanx…
      
      LikeLike
      
      Reply
praveen kumar karn says:

March 19, 2019 at 3:42 pm

can u provide me detail about enron.py as lingsapm.py…..plz..reply soon if possible provide me also

LikeLike

Reply
Ahriev says:

April 30, 2019 at 3:45 am

Hi, I just realize that in step 1 (Preparing the text data) for Enron dataset I think the removal of stop word and lemmatization hasn’t proceeded (or it has proceeded?) Because I check the file inside the dataset and all the stop word still in there, and the first word that inside the dictionary is “the”. So it’s quite different than Ling-spam corpus that used in this article…

LikeLike

Reply
Pingback: Day 3 of Summer Internship – Hacksd
Pingback: Day 4 of Summer Internship – Hacksd
Nilutpol Kashyap says:

June 21, 2019 at 7:03 pm

Traceback (most recent call last):

File “”, line 8, in
train_matrix = extract_features(train_dir)

File “”, line 15, in extract_features
features_matrix[docID,wordID] = words.count(word)

IndexError: index 15049 is out of bounds for axis 1 with size 15000

why am i getting the above error while training the classifiers?
please help me with that.
thank you!!!!!!!!!!!!!

LikeLike

Reply
Pingback: 文本分类：垃圾邮件分类 - 算法网