Texti.ai
  • Overview
  • Task
    • Text Classification
    • Text Generation
    • AI Co-Pilot
  • Data & Labeling
    • Text Classification
    • Text Generation
    • AI Co-Pilot
  • Fine-tuning
    • Model Selection
    • Hyperparameters
    • Model Fine-tuning
    • Model Evaluation
  • In-Context Learning
    • Model Selection
    • Hyperparameters
    • Evaluation
  • AI Agent
    • Real-Time Feedback-to-Action AI Agent
  • Extras
  • How to Get an OpenAI API Key & Organization ID
  • Automatic Dataset Balancing For Classification Tasks
Powered by GitBook
On this page

Automatic Dataset Balancing For Classification Tasks

PreviousHow to Get an OpenAI API Key & Organization ID

Last updated 1 year ago

The dataset used for fine-tuning a model plays a vital role in achieving accurate classification results. In scenarios where the dataset is imbalanced, meaning that the number of samples in each class varies significantly, the fine-tuned model may become biased towards the majority class, leading to suboptimal performance for the minority classes.

Texti.ai incorporates an automatic dataset balancing mechanism during the fine-tuning process for classification tasks, which helps address this problem.

To address this problem, Texti.ai automatically balances your dataset when you fine-tune it for classification tasks. By doing so, the model can be trained on a representative set of data that contains an equal proportion of samples from each class, enhancing its ability to classify instances accurately across all classes.

Let's take an example utilizing this . The dataset consists of 8,001 pairs of prompts and completions, with each prompt assigned to one of three classes: Positive, Negative, or Neutral.

The original distribution of instances is as follows:

  • Negative: 2674 instances

  • Positive: 2727 instances

  • Neutral: 2600 instances

In this scenario, the "Neutral" class has the lowest number of instances, specifically 2600 samples. Consequently, for the fine-tuning process, the final dataset (training & validation) will be adjusted to include a total of 7800 instances, with an equal distribution of 2600 samples from each class.

dataset