Building Naive-Bayes Classifier from Scratch.

harishpabolu777
Apr 15, 2023
4 min read

Updated: Apr 16, 2023

Source :https://insightimi.wordpress.com/2020/04/04/naive-bayes-classifier-from-scratch-with-hands-on-examples-in-r/

CODE:

Navie-Bayes:

Naive-Bayes algorithm is based on Bayes theorem which helps in classification.

Spam filtering, sentiment analysis recommendation system, etc.

In this project, we will create NaiveBayes from scratch without using Navie bayes library.

Naive-Bayes it a algorithm which is used especially for text classification, image classification etc. For subjects it is one of the most powerful algorithm.

For detailed information we use:

https://www.ibm.com/topics/naive-bayes#:~:text=The%20Na%C3%AFve%20Bayes%20classifier%20is,a%20given%20class%20or%20category.

let's get basic understanding about Navie-Bayes and its implementation:

Formula used in Navie-Bayes:

Using the Bayes Theorem we can find the probability that event A will happen under a given condition B .

B is the evidence and A is the Hypothesis.

Based the probabilities we can find the class of which the texts belong to.

There are some Assumptions that need to be considered

1.The features of Naive Bayes are independent of each other.

Advantages:

Navie Bayes require very low amount of training data.
Predictions are quick when compared to others.
It is very easy to implement
it works on discrete data and also on continues data.

Disadvantages:

If your test data set has a categorical variable of a category that wasn't present in the training data set, the Naive Bayes model will assign it zero probability and won't be able to make any predictions in this regard.
It assumes that all the features are independent. While it might sound great in theory, in real life, you’ll hardly find a set of independent features.

About The Data:

For this project we are using the datasets that are from the kaggle:

The data shows us the reviews given by a website Rotten Tomatoes reviews. Based on the review given by the website our aim is to classify the reviews into fresh and rotten.

They are two types of data and they are training data, dev data and testing data.

Training data :

For this we consider reviews from 1-80000 as training data for trainning the classifier.

Testing Data:

For Testing Data our aim is to find to which category the review belongs to either fresh or rotten based on the text sentence in the reviews.

Dev Data:

Here we consider 20% of given data for validation purpose and that data is called Dev data and it is used to find the accuracy of our classifier.

Building Vocabulary:

Here we use nltk for tokenizing splitting of words in each review.

Here we are omitting the rare words which are having the occurance is less than 5 and also removing the stop words and making the data feasible as required for our classifier and printing the reverse indexes.

Calculating the Probability of Occurrence of "the" word:

Next we are calculating the probability if the word the in the training data. We are using the following code for calculating the probability of word "the" in training data. We can use the same for finding the probability of any word. Here the probability of word "the" is 63.6.

Calculating the probability of word "the" in Positive statements:

Later we found out the conditional probability. here we find conditional probability of word "the" in positive review. Here to find the probability of word "the" in good condition we use Navie-Bayes.

Calculating the accuracy of Dev dataset:

Here for our project we divided our dat into 3 parts and one among them is dev data. Here the dev data is also called as validation data. In the name it self it states that this data is used for validation of our classifier. So we are using this dev data for calculating the accuracy of our classifier.

Laplace Smoothing:

There are some words which are present in the testing set but not in the training set. so in order to handle such missing words we use laplace smoothing. This returns the conditional probabilities.

we have used the Laplace smoothing technique. We have counted the number of positives and negatives in the training set and counted the total words. We haven't used the testing data we have used only the training data set because we need to use only the training. And we created a dictionary and mapped the words with the indexes. And with the Laplace formula, We have calculated the probability of each word with the given class, whether positive or negative. And we have performed smoothing for 1, 10, and 0.5 on the dev dataset we got better Smoothing With the smoothing value 1.

Finding top 10 Words that predicts the class of review:

We have to derive the top 10 words that predicts each class. p[class|word].

Here to derive the top 10 words which predicts each class we use the navie-bayes classifier that we constructed and trained with the given dataset.

Accuracy of the Classifier:

Here to find the optimal parameters we performed smoothing with different values and we find out the smoothing with hyperparameter 1 gives better results so we use smoothing value as 1 and we find out the final accuracy of the Navie-Bayes classifier and the fin al accuracy is 52.61%

Challenges:

I encountered with attribute error while removing the stop words.
There are a lot errors in tokenizing and lemmatizing, the collab cell suggested in downloading nltk library.
Tried a lot of smoothing values to find a optimal parameter for getting highest accuracy.

Contributions:

Dividing the entire dat set into train , test and dev data sets.
Got to know how to implement text classification using Navie-Bayes.
Tired to improve accuracy of the model by playing with the values of smoothing even it just gives a slight increase in the accuracy.

References:

Building Naive-Bayes Classifier from Scratch.

Recent Posts

Comments