Automatic Dataset Balancing For Classification Tasks

The dataset used for fine-tuning a model plays a vital role in achieving accurate classification results. In scenarios where the dataset is imbalanced, meaning that the number of samples in each class varies significantly, the fine-tuned model may become biased towards the majority class, leading to suboptimal performance for the minority classes.

Texti.ai incorporates an automatic dataset balancing mechanism during the fine-tuning process for classification tasks, which helps address this problem.

To address this problem, Texti.ai automatically balances your dataset when you fine-tune it for classification tasks. By doing so, the model can be trained on a representative set of data that contains an equal proportion of samples from each class, enhancing its ability to classify instances accurately across all classes.

Let's take an example utilizing this dataset. The dataset consists of 8,001 pairs of prompts and completions, with each prompt assigned to one of three classes: Positive, Negative, or Neutral.

The original distribution of instances is as follows:

  • Negative: 2674 instances

  • Positive: 2727 instances

  • Neutral: 2600 instances

In this scenario, the "Neutral" class has the lowest number of instances, specifically 2600 samples. Consequently, for the fine-tuning process, the final dataset (training & validation) will be adjusted to include a total of 7800 instances, with an equal distribution of 2600 samples from each class.

Last updated