Automatic Dataset Balancing For Classification Tasks
Last updated
Last updated
The dataset used for fine-tuning a model plays a vital role in achieving accurate classification results. In scenarios where the dataset is imbalanced, meaning that the number of samples in each class varies significantly, the fine-tuned model may become biased towards the majority class, leading to suboptimal performance for the minority classes.
To address this problem, Texti.ai automatically balances your dataset when you fine-tune it for classification tasks. By doing so, the model can be trained on a representative set of data that contains an equal proportion of samples from each class, enhancing its ability to classify instances accurately across all classes.
Let's take an example utilizing this . The dataset consists of 8,001 pairs of prompts and completions, with each prompt assigned to one of three classes: Positive, Negative, or Neutral.
The original distribution of instances is as follows:
Negative: 2674 instances
Positive: 2727 instances
Neutral: 2600 instances
In this scenario, the "Neutral" class has the lowest number of instances, specifically 2600 samples. Consequently, for the fine-tuning process, the final dataset (training & validation) will be adjusted to include a total of 7800 instances, with an equal distribution of 2600 samples from each class.