Self-paced

Explore our extensive collection of courses designed to help you master various subjects and skills. Whether you're a beginner or an advanced learner, there's something here for everyone.

Bootcamp

Learn live

Join us for our free workshops, webinars, and other events to learn more about our programs and get started on your journey to becoming a developer.

Upcoming live events

Learning library

For all the self-taught geeks out there, here is our content library with most of the learning materials we have produced throughout the years.

It makes sense to start learning by reading and watching videos about fundamentals and how things work.

Full-Stack Software Developer - 16w

Data Science and Machine Learning - 16 wks

Search from all Lessons


LoginGet Started

Register to 4Geeks

← Back to Projects

Naive Bayes Project Tutorial

Difficulty

  • beginner

Average duration

2 hrs

Technologies

Difficulty

  • beginner

Average duration

2 hrs

Weekly Coding Challenge

Every week, we pick a real-life project to build your portfolio and get ready for a job. All projects are built with ChatGPT as co-pilot!

Start the Challenge

Podcast: Code Sets You Free

A tech-culture podcast where you learn to fight the enemies that blocks your way to become a successful professional in tech.

Listen the podcast
  • Understand a new dataset.
  • Process it by applying exploratory data analysis (EDA).
  • Model the data using Naive Bayes.
  • Analyze the results and optimize the model.

🌱 How to start this project

Follow the instructions below:

  1. Create a new repository based on machine learning project by clicking here.
  2. Open the newly created repository in Codespace using the Codespace button extension.
  3. Once the Codespace VSCode has finished opening, start your project by following the instructions below.

πŸš› How to deliver this project

Once you have finished solving the exercises, be sure to commit your changes, push them to your repository, and go to 4Geeks.com to upload the repository link.

πŸ“ Instructions

Sentiment analysis

Naive Bayes models are very useful when we want to analyze sentiment, classify texts into topics or recommendations, as the characteristics of these challenges meet the theoretical and methodological assumptions of the model very well.

In this project you will practice with a dataset to create a review classifier for the Google Play store.

Step 1: Loading the dataset

The dataset can be found in this project folder under the name playstore_reviews.csv. You can load it into the code directly from the link:

1https://raw.githubusercontent.com/4GeeksAcademy/naive-bayes-project-tutorial/main/playstore_reviews.csv

Or download it and add it by hand in your repository. In this dataset, you will find the following variables:

  • package_name. Name of the mobile application (categorical)
  • review. Comment about the mobile application (categorical)
  • polarity. Class variable (0 or 1), being 0 a negative comment and 1, positive (numeric)

Step 2: Study of variables and their content

In this case, we have only 3 variables: 2 predictors and a dichotomous label. Of the two predictors, we are really only interested in the comment part, since the fact of classifying a comment as positive or negative will depend on its content, not on the application from which it was written. Therefore, the package_name variable should be removed.

When we work with text, as in this case, it does not make sense to do an EDA, the process is different, since the only variable we are interested in is the one that contains the text. In other cases where the text is part of a complex set with other numeric predictor variables and the prediction objective is different, then it makes sense to apply an EDA.

However, we cannot work with plain text; it must first be processed. This process consists of several steps:

  1. Removing spaces and converting the text to lowercase:
1df["column"] = df["column"].str.strip().str.lower()
  1. Divide the dataset into train and test: X_train, X_test, y_train, y_test.
  2. Transform the text into a word count matrix. This is a way to obtain numerical features from the text. For this, we use the training set to train the transformer and apply it in test:
1vec_model = CountVectorizer(stop_words = "english") 2X_train = vec_model.fit_transform(X_train).toarray() 3X_test = vec_model.transform(X_test).toarray()

Once we have finished we will have the predictors ready to train the model.

Step 3: Build a naive bayes model

Start solving the problem by implementing a model, from which you will have to choose which of the three implementations to use: GaussianNB, MultinomialNB or BernoulliNB, according to what we have studied in the module. Try now to train it with the two other implementations and confirm if the model you have chosen is the right one.

Step 4: Optimize the previous model

After training the model in its three implementations, choose the best option and try to optimize its results with a random forest, if possible.

Step 5: Save the model

Store the model in the appropriate folder.

Step 6: Explore other alternatives

Which other models of the ones we have studied could you use to try to overcome the results of a Naive Bayes? Argue this and train the model.

Note: We also incorporated the solution samples on ./solution.ipynb that we strongly suggest you only use if you are stuck for more than 30 min or if you have already finished and want to compare it with your approach.

Sign up and get access to solution files and videos

We will use it to give you access to your account.
Already have an account? Login here.

Difficulty

  • beginner

Average duration

2 hrs

Difficulty

  • beginner

Average duration

2 hrs

Difficulty

  • beginner

Average duration

2 hrs

Difficulty

  • beginner

Average duration

2 hrs

Sign up and get access to solution files and videos

We will use it to give you access to your account.
Already have an account? Login here.

Difficulty

  • beginner

Average duration

2 hrs

Difficulty

  • beginner

Average duration

2 hrs

Weekly Coding Challenge

Every week, we pick a real-life project to build your portfolio and get ready for a job. All projects are built with ChatGPT as co-pilot!

Start the Challenge

Podcast: Code Sets You Free

A tech-culture podcast where you learn to fight the enemies that blocks your way to become a successful professional in tech.

Listen the podcast