METHODS FOR DETERMINING SIMILARITY OF CATEGORICAL ORDERED DATA

Authors

  • N. E. Kondruk Uzhhorod National University, Uzhhorod, Ukraine, Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2023-2-4

Keywords:

distance metric, similarity measure, categorical data similarity, ordered data

Abstract

Context. The development of effective distance metrics and similarity measures for categorical features is an important task in data analysis, machine learning, and decision theory since a significant portion of object properties is described by non-numerical values. Typically, the dependence between categorical features may be more complex than simply comparing them for equality or inequality. Such attributes can be relatively similar, and to construct an effective model, it is necessary to consider this similarity when calculating distance or similarity measures.

Objective. The aim of the study is to improve the efficiency of solving practical data analysis problems by developing mathematical tools for determining the similarity of objects based on categorical ordered features.

Method. A distance based on weighted Manhattan distance and a similarity measure for determining the similarity of objects based on categorical ordinal features (i.e. a linear order with scales of preference considering the problem domain can be specified on the attribute value set) are proposed. It is proven that the distance formula satisfies the axioms of non-negativity, symmetry, triangle inequality, and upper bound, and therefore is a distance metric in the space of ranked categorical features. It is also proven that the similarity measure presented in the study satisfies the axioms of boundedness, symmetry, maximum and minimum similarity, and is described by a decreasing function.

Results. The developed approach has been implemented in an applied problem of determining the degree of similarity between objects described by ordered categorical features.

Conclusions. In this study, mathematical tools were developed to determine similarity between structured data described by categorical attributes that can be ordered based on a specific priority in the form of a ranking system with preferences. Their properties were analyzed. Experimental studies have shown the convenience and “intuitive understanding” of the logic of data processing in solving practical problems. The proposed approach can provide the opportunity to conduct new meaningful research in data analysis. Prospects for further research lie in the experimental use of the proposed tools in practical tasks and in studying their effectiveness.

Author Biography

N. E. Kondruk, Uzhhorod National University, Uzhhorod, Ukraine

PhD, Associate professor, Associate Professor of Department of Cybernetics and Applied Mathematics

References

Suárez J., García S., Herrera F. A tutorial on distance metric learning: Mathematical foundations, algorithms, experimental analysis, prospects and challenges, Neurocomputing, 2021, № 425, pp. 300–322. DOI: 10.1016/j.neucom.2020.08.017

Mathisen B., Aamodt A., Bach K., Langseth H. Learning similarity measures from data, Progress in Artificial Intelligence, 2019, № 9, pp. 1–15. DOI: 9. 10.1007/s13748019-00201-2.

Desai A., Singh H., Pudi V. et al. DISC: Data-Intensive Similarity Measure for Categorical Data, Advances in Knowledge Discovery and Data Mining, 2011, №6635, pp. 469–481. DOI: 10.1007/978-3-642-20847-8_39

Cunningham P. A Taxonomy of Similarity Mechanisms for Case-Based Reasoning, IEEE Transactions on Knowledge and Data Engineering, 2009, №21, pp. 1532–1543. DOI: 10.1109/TKDE.2008.227.

Nikpour N., Aamodt A., Bach K. Bayesian-Supported Retrieval in BNCreek: A Knowledge-Intensive Case-Based Reasoning System, Case-Based Reasoning Research and Development, 2018, №11156, pp. 323–338. DOI: 10.1007/978-3-030-01081-2_22

Gabel T., Godehardt E., Hüllermeier E., Minor M. TopDown Induction of Similarity Measures Using Similarity, Case-Based Reasoning Research and Development, 2015, №9343, pp. 149–164. DOI: 10.1007/978-3-319-24586-7_11

Hoffer E., Ailon N. Deep Metric Learning Using Triplet Network, Similarity-Based Pattern Recognition, 2014, №9370, pp. 84–92. DOI: 10.1007/978-3-319-24261-3_7

Nguyen T., Dinh T., Sriboonchitta S., Huynh V. A method for k-means-like clustering of categorical data, Journal of Ambient Intelligence and Humanized Computing, 2019, pp. 1–11. DOI: 10.1007/s12652-019-01445-5.

Mathisen B., Aamodt A., Bach K., Langseth H. Learning similarity measures from data, Progress in Artificial Intelligence, 2020, №9, pp. 129–143. DOI: 10.1007/s13748019-00201-2

Dyussenbayev A. Age Periods Of Human Life, Advances in Social Sciences Research Journal, 2017, №4, pp. 258–263. DOI:10.14738/assrj.46.2924

Kondruk N. Clustering method based on fuzzy binary relation, Eastern-European Journal of Enterprise Technologies, 2017, № 2(4), pp. 10–16. DOI: 10.15587/1729–4061.2017.94961

Kondruk, N. E., Malyar M. M. Analysis of Cluster Structures by Different Similarity Measures, Cybernetics and Systems Analysis, 2021, №57, pp. 436–441. https://doi.org/10.1007/s10559-021-00368-4.

Kondruk N. E. Vykorystannja dovzhynnoi’ miry podibnosti v zadachah klasteryzacii’, Radio Electronics, Computer Science, Control, 2018, №3 (46), pp. 98–105. DOI: 10.15588/1607-3274-2018-3-11.

Published

2023-06-29

How to Cite

Kondruk, N. E. (2023). METHODS FOR DETERMINING SIMILARITY OF CATEGORICAL ORDERED DATA . Radio Electronics, Computer Science, Control, (2), 31. https://doi.org/10.15588/1607-3274-2023-2-4

Issue

Section

Mathematical and computer modelling