Naive Bayes Project Tutorial

Difficulty

beginner

Average duration

2 hrs

Technologies

Data Science

Machine Learning

Difficulty

beginner

Average duration

2 hrs

Technologies

Data Science

Machine Learning

Weekly Coding Challenge

Every week, we pick a real-life project to build your portfolio and get ready for a job. All projects are built with ChatGPT as co-pilot!

Start the Challenge

Podcast: Code Sets You Free

A tech-culture podcast where you learn to fight the enemies that blocks your way to become a successful professional in tech.

Listen the podcast

Understand a new dataset.
Process it by applying exploratory data analysis (EDA).
Model the data using Naive Bayes.
Analyze the results and optimize the model.

🌱 How to start this project

Follow the instructions below:

Create a new repository based on machine learning project by clicking here.
Open the newly created repository in Codespace using the Codespace button extension.
Once the Codespace VSCode has finished opening, start your project by following the instructions below.

🚛 How to deliver this project

Once you have finished solving the exercises, be sure to commit your changes, push them to your repository, and go to 4Geeks.com to upload the repository link.

📝 Instructions

Sentiment analysis

Naive Bayes models are very useful when we want to analyze sentiment, classify texts into topics or recommendations, as the characteristics of these challenges meet the theoretical and methodological assumptions of the model very well.

In this project you will practice with a dataset to create a review classifier for the Google Play store.

Step 1: Loading the dataset

The dataset can be found in this project folder under the name playstore_reviews.csv. You can load it into the code directly from the link:

1https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv

Or download it and add it by hand in your repository. In this dataset, you will find the following variables:

package_name. Name of the mobile application (categorical)
review. Comment about the mobile application (categorical)
polarity. Class variable (0 or 1), being 0 a negative comment and 1, positive (numeric)

Step 2: Study of variables and their content

In this case, we have only 3 variables: 2 predictors and a dichotomous label. Of the two predictors, we are really only interested in the comment part, since the fact of classifying a comment as positive or negative will depend on its content, not on the application from which it was written. Therefore, the package_name variable should be removed.

When we work with text, as in this case, it does not make sense to do an EDA, the process is different, since the only variable we are interested in is the one that contains the text. In other cases where the text is part of a complex set with other numeric predictor variables and the prediction objective is different, then it makes sense to apply an EDA.

However, we cannot work with plain text; it must first be processed. This process consists of several steps:

Removing spaces and converting the text to lowercase:

1df["column"] = df["column"].str.strip().str.lower()

Divide the dataset into train and test: X_train, X_test, y_train, y_test.
Transform the text into a word count matrix. This is a way to obtain numerical features from the text. For this, we use the training set to train the transformer and apply it in test:

1vec_model = CountVectorizer(stop_words = "english")
2X_train = vec_model.fit_transform(X_train).toarray()
3X_test = vec_model.transform(X_test).toarray()

Once we have finished we will have the predictors ready to train the model.

Step 3: Build a naive bayes model

Start solving the problem by implementing a model, from which you will have to choose which of the three implementations to use: GaussianNB, MultinomialNB or BernoulliNB, according to what we have studied in the module. Try now to train it with the two other implementations and confirm if the model you have chosen is the right one.

Step 4: Optimize the previous model

After training the model in its three implementations, choose the best option and try to optimize its results with a random forest, if possible.

Step 5: Save the model

Store the model in the appropriate folder.

Step 6: Explore other alternatives

Which other models of the ones we have studied could you use to try to overcome the results of a Naive Bayes? Argue this and train the model.

Note: We also incorporated the solution samples on ./solution.ipynb that we strongly suggest you only use if you are stuck for more than 30 min or if you have already finished and want to compare it with your approach.

Sign up and get access to solution files and videos

Difficulty

beginner

Average duration

2 hrs

Technologies

Data Science

Machine Learning

Difficulty

beginner

Average duration

2 hrs

Technologies

Data Science

Machine Learning

Register to 4Geeks

Naive Bayes Project Tutorial

Weekly Coding Challenge

Podcast: Code Sets You Free

🌱 How to start this project

🚛 How to deliver this project

📝 Instructions

Sentiment analysis

Step 1: Loading the dataset

Step 2: Study of variables and their content

Step 3: Build a naive bayes model

Step 4: Optimize the previous model

Step 5: Save the model

Step 6: Explore other alternatives

Sign up and get access to solution files and videos

Sign up and get access to solution files and videos

Weekly Coding Challenge

Podcast: Code Sets You Free