EVALUATION OF THE INFORMATIVENESS OF DISCRETE FEATURES OF A TRAINING SAMPLE IN CLASSIFICATION PROBLEMS

G. Ya. Shevchenko; S. M.  Gerasimenko

doi:10.15588/1607-3274-2026-2-10

Authors

G. Ya. Shevchenko Noosphere Association, Dnipro, Ukraine
S. M. Gerasimenko Noosphere Association, Dnipro, Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2026-2-10

Keywords:

information content of features, discrete data, entropy, data mining, classification, feature selection, feature interaction, web service

Abstract

Context. Modern data mining methods are widely used to build classification and predictive models. However, when processing discrete data, the problem of quantitatively assessing the informativeness of features, which determines the accuracy and stability of classifiers, remains. The lack of a universal approach to measuring the contribution of both individual features and groups of features to the classification result complicates the process of automated feature selection and model optimization.
Objective. The aim of this paper is to develop and theoretically substantiate a method for assessing the informativeness of both individual features and arbitrary groups of discrete features, based on the relationship between the statistical characteristics of features and measures of class distinguishability.
Method. An approach is proposed that links the informativeness of both individual discrete features and arbitrary groups of such features with respect to a function characterizing the target variable. The method is based on the research results of renowned scientists Kendall and Stewart in the field of nonparametric statistics. For practical application, an algorithm for calculating the informativeness of both individual features and groups of features is introduced, suitable for implementation in data analysis software systems.
Results. It is demonstrated that the developed method enables formal and quantitative evaluation of the contribution of both individual features and arbitrary groups of features to the classification process without prior assumptions about the model type. It also enables the identification of hidden dependencies between features, which is impossible with individual assessments; i.e., it enables the identification of feature interactions. The resulting expressions provide a basis for automating feature selection when working with discrete data, improve the analytical value of the method, and offer a basis for meaningful feature selection.
Conclusions. The proposed approach enables the unification of the procedure for assessing the informativeness of both
individual discrete features and arbitrary groups of features in data mining systems. It provides a formal link between the statistical characteristics of the data and the quality of the classification, which contributes to increased accuracy, robustness, and interpretability of models

Author Biographies

G. Ya. Shevchenko, Noosphere Association, Dnipro

PhD, Head of Department

S. M. Gerasimenko, Noosphere Association, Dnipro

Researcher

References

Guyon I., Elisseeff A.An introduction to variable and feature selection. Journal of Machine Learning Research, 2003, Vol. 3, pp. 1157–1182.

Theng D., Bhoyar K. K. Feature selection techniques for machine learning: a survey of more than two decades of research [Electronic resource]. Knowledge and Information Systems, 2023, Vol. 66, No. 3, pp. 1575–1637. DOI: 10.1007/s10115-023-02010-5.

Sammut C., Webb G. Encyclopedia of Machine Learning. New York, Springer Science+Business Media, 2010, 1059 p.

Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling [Electronic resource]. The London, Edinburgh, and

Dublin Philosophical Magazine and Journal of Science, 1900, Vol. 50, No. 302, pp. 157–175. DOI: 10.1080/14786440009463897.

Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning. 2nd ed. Berlin, Springer, 2009, 764 p.

Quinlan J. Induction of decision trees. Machine Learning, 1986, Vol. 1, No. 1, pp. 81–106.

Shannon C. E., Weaver W. The Mathematical Theory of Communication. Urbana, University of Illinois Press, 1949, 117 p.

Kira K., Rendell L. A practical approach to feature selection. Proceedings of the International Conference on Machine Learning, ed. by D. Sleeman, P. Edwards. Aberdeen, 1992, pp. 368–377.

Gretton A. et al. Kernel methods for measuring independence. Journal of Machine Learning Research, 2005, Vol. 6, pp. 2075–2129.

Kursa M. B., Rudnicki W. R. Feature selection with the Boruta package. Journal of Statistical Software, 2010, Vol. 36, No. 11, pp. 1–13.

Breiman L. et al. Classification and Regression Trees. Boca Raton, CRC Press, 1984, 366 p.

Spiegel M. Statistics, 4th ed. New York, McGraw-Hill, 2009, (Schaum’s Outline Series), 577 p.

Agresti A. Introduction to Categorical Data Analysis. New York, John Wiley & Sons, 2007, 400 p.

Adasovsky B. I. Method for calculating the information content of multimodal features. Doklady Akademii Nauk SSSR, 1978, Vol. 239, No. 2, pp. 286–289.

Kullback S. Information Theory and Statistics. New York, John Wiley & Sons, 1959, 399 p.

Kendall M., Stewart A.; trans. from Engl. by L. I. Galchuk, A. T. Terehin. Statistical Inferences and Relationships. Moscow, Nauka, 1973, 900 p.

Rokach L., Maimon O. Data Mining Using Decision Trees: Theory and Application. Singapore, World Scientific Publishing, 2008, 267 p.

Vasilenko Yu. A., Shevchenko G. Ya. Analytical method for finding tests. Avtomatika, 1979, No. 2, pp. 3–8.

Jolliffe I. T. Principal Component Analysis. New York, Springer, 2002, 487 p.

Kim D. O., Mueller C. W., Klekka W. R. ; trans. from Engl. by A. M. Hotinskiy, S. B. Korolev Factor, Discriminant, and Cluster Analysis. Moscow, Finance and Statistics, 1989, 215 p.

UC Irvine Machine Learning Repository. Welcome to the UC Irvine Machine Learning Repository [Electronic resource]. UC Irvine Machine Learning Repository. Mode of access: https://archive.ics.uci.edu/ (date of access: 03.12.2025). Title from screen.

EVALUATION OF THE INFORMATIVENESS OF DISCRETE FEATURES OF A TRAINING SAMPLE IN CLASSIFICATION PROBLEMS

Authors

DOI:

Keywords:

Abstract

Author Biographies

G. Ya. Shevchenko, Noosphere Association, Dnipro

S. M. Gerasimenko, Noosphere Association, Dnipro

References

Downloads

Published

How to Cite

Issue

Section

License

Creative Commons Licensing Notifications in the Copyright Notices

Information

Current Issue

Announcements