Tiny Language Classifier

Deep dive into diffusion models by building one from the ground up.

View on GitHub

Overview

The Tiny Language Classifier project was born out of my curiosity to test out Naive Bayes classifiers for text classification tasks. Language identification is a fundamental problem in natural language processing, and I wanted to explore how well a simple model like Naive Bayes could perform on this task.

This is nothing BIG, but I feel like using overly complex models for simple tasks is a common pitfall in machine learning these days.

Project Details

The model was trained and tested on a language identification dataset available on Hugging Face. The text data is then vectorized and fed into a Naive Bayes (MultinomialNB) classifier implemented using Scikit-Learn.

Even with its simplicity, the model achieved impressive over 90% accuracy on the test set.

Technologies Used

  • Python
  • Scikit-Learn
  • NumPy
  • Pandas
  • Hugging Face (datasets)