Published In
2025 IEEE International Conference on Data Mining (ICDM)
Document Type
Pre-Print
Publication Date
2025
Abstract
Feature generation can significantly enhance learning outcomes, particularly for tasks with limited data. An effective way to improve feature generation is to expand the current feature space using existing features and enriching the informational content. However, generating new, interpretable features usually requires domain-specific knowledge on top of the existing features. In this paper, we introduce a Retrieval-Augmented Feature Generation method, RAFG, to generate useful and explainable features specific to domain classification tasks. To increase the interpretability of the generated features, we conduct knowledge retrieval among the existing features in the domain to identify potential feature associations. These associations are expected to help generate useful features. Moreover, we develop a framework based on large language models (LLMs) for feature generation with reasoning to verify the quality of the features during their generation process. Experiments across several datasets in medical, economic, and geographic domains show that our RAFG method can produce high-quality, meaningful features and significantly improve classification performance compared with baseline methods.
Rights
Copyright (c) 2026 The Authors
This work is licensed under a Creative Commons Attribution 4.0 International License.
DOI
10.1109/ICDM65498.2025.00102
Persistent Identifier
https://archives.pdx.edu/ds/psu/44562
Publisher
IEEE
Citation Details
Published as: Zhang, X., Zhang, J., Mo, F., Chandra, D. K., Chen, Y.-Z., Xie, F., & Liu, K. (2025). Retrieval-Augmented Feature Generation for Domain-Specific Classification. 2025 IEEE International Conference on Data Mining (ICDM), 943–952. https://doi.org/10.1109/icdm65498.2025.00102

Description
This is the author’s version of a work that was accepted for publication. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published as: 2025). Retrieval-Augmented Feature Generation for Domain-Specific Classification. 2025 IEEE International Conference on Data Mining (ICDM), 943–952. https://doi.org/10.1109/icdm65498.2025.00102