EVALUATION OF THE INFORMATIVENESS OF DISCRETE FEATURES OF A TRAINING SAMPLE IN CLASSIFICATION PROBLEMS
DOI:
https://doi.org/10.15588/1607-3274-2026-2-10Keywords:
information content of features, discrete data, entropy, data mining, classification, feature selection, feature interaction, web serviceAbstract
Context. Modern data mining methods are widely used to build classification and predictive models. However, when processing discrete data, the problem of quantitatively assessing the informativeness of features, which determines the accuracy and stability of classifiers, remains. The lack of a universal approach to measuring the contribution of both individual features and groups of features to the classification result complicates the process of automated feature selection and model optimization.
Objective. The aim of this paper is to develop and theoretically substantiate a method for assessing the informativeness of both individual features and arbitrary groups of discrete features, based on the relationship between the statistical characteristics of features and measures of class distinguishability.
Method. An approach is proposed that links the informativeness of both individual discrete features and arbitrary groups of such features with respect to a function characterizing the target variable. The method is based on the research results of renowned scientists Kendall and Stewart in the field of nonparametric statistics. For practical application, an algorithm for calculating the informativeness of both individual features and groups of features is introduced, suitable for implementation in data analysis software systems.
Results. It is demonstrated that the developed method enables formal and quantitative evaluation of the contribution of both individual features and arbitrary groups of features to the classification process without prior assumptions about the model type. It also enables the identification of hidden dependencies between features, which is impossible with individual assessments; i.e., it enables the identification of feature interactions. The resulting expressions provide a basis for automating feature selection when working with discrete data, improve the analytical value of the method, and offer a basis for meaningful feature selection.
Conclusions. The proposed approach enables the unification of the procedure for assessing the informativeness of both
individual discrete features and arbitrary groups of features in data mining systems. It provides a formal link between the statistical characteristics of the data and the quality of the classification, which contributes to increased accuracy, robustness, and interpretability of models
References
Guyon I., Elisseeff A.An introduction to variable and feature selection. Journal of Machine Learning Research, 2003, Vol. 3, pp. 1157–1182.
Theng D., Bhoyar K. K. Feature selection techniques for machine learning: a survey of more than two decades of research [Electronic resource]. Knowledge and Information Systems, 2023, Vol. 66, No. 3, pp. 1575–1637. DOI: 10.1007/s10115-023-02010-5.
Sammut C., Webb G. Encyclopedia of Machine Learning. New York, Springer Science+Business Media, 2010, 1059 p.
Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling [Electronic resource]. The London, Edinburgh, and
Dublin Philosophical Magazine and Journal of Science, 1900, Vol. 50, No. 302, pp. 157–175. DOI: 10.1080/14786440009463897.
Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning. 2nd ed. Berlin, Springer, 2009, 764 p.
Quinlan J. Induction of decision trees. Machine Learning, 1986, Vol. 1, No. 1, pp. 81–106.
Shannon C. E., Weaver W. The Mathematical Theory of Communication. Urbana, University of Illinois Press, 1949, 117 p.
Kira K., Rendell L. A practical approach to feature selection. Proceedings of the International Conference on Machine Learning, ed. by D. Sleeman, P. Edwards. Aberdeen, 1992, pp. 368–377.
Gretton A. et al. Kernel methods for measuring independence. Journal of Machine Learning Research, 2005, Vol. 6, pp. 2075–2129.
Kursa M. B., Rudnicki W. R. Feature selection with the Boruta package. Journal of Statistical Software, 2010, Vol. 36, No. 11, pp. 1–13.
Breiman L. et al. Classification and Regression Trees. Boca Raton, CRC Press, 1984, 366 p.
Spiegel M. Statistics, 4th ed. New York, McGraw-Hill, 2009, (Schaum’s Outline Series), 577 p.
Agresti A. Introduction to Categorical Data Analysis. New York, John Wiley & Sons, 2007, 400 p.
Adasovsky B. I. Method for calculating the information content of multimodal features. Doklady Akademii Nauk SSSR, 1978, Vol. 239, No. 2, pp. 286–289.
Kullback S. Information Theory and Statistics. New York, John Wiley & Sons, 1959, 399 p.
Kendall M., Stewart A.; trans. from Engl. by L. I. Galchuk, A. T. Terehin. Statistical Inferences and Relationships. Moscow, Nauka, 1973, 900 p.
Rokach L., Maimon O. Data Mining Using Decision Trees: Theory and Application. Singapore, World Scientific Publishing, 2008, 267 p.
Vasilenko Yu. A., Shevchenko G. Ya. Analytical method for finding tests. Avtomatika, 1979, No. 2, pp. 3–8.
Jolliffe I. T. Principal Component Analysis. New York, Springer, 2002, 487 p.
Kim D. O., Mueller C. W., Klekka W. R. ; trans. from Engl. by A. M. Hotinskiy, S. B. Korolev Factor, Discriminant, and Cluster Analysis. Moscow, Finance and Statistics, 1989, 215 p.
UC Irvine Machine Learning Repository. Welcome to the UC Irvine Machine Learning Repository [Electronic resource]. UC Irvine Machine Learning Repository. Mode of access: https://archive.ics.uci.edu/ (date of access: 03.12.2025). Title from screen.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 G. Ya. Shevchenko, S. M. Gerasimenko

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) as it can lead to productive exchanges, as well as earlier and greater citation of published work.