Actes des 30es rencontres de la Société Francophone de Classification

« In NLP, handling underrepresented topics is challenging, particularly in unsupervised tasks where clustering may fail to capture minority topics effectively. To address this, we propose an unsupervised data augmentation method that combines Gaussian Mixture Models (GMMs) and Large Language Models (LLMs). GMMs identify underrepresented clusters, while LLMs generate synthetic documents to enrich them. Experiments on various imbalanced text datasets show that our approach maintains clustering performance and often improves interpretability, providing a robust and scalable solution for enhancing data representation in unsupervised NLP. (…) »

source > hal.science, Pascal Préa. Actes des 30es rencontres de la Société Francophone de Classification. Plate-Forme Intelligence Artificielle, Jul 2025, Dijon, France. Association Française pour l'Intelligence Artificielle, 2025. ⟨hal-05189785v2⟩

Accueil