METHODS FOR DETERMINING SIMILARITY OF CATEGORICAL ORDERED DATA
DOI:
https://doi.org/10.15588/1607-3274-2023-2-4Keywords:
distance metric, similarity measure, categorical data similarity, ordered dataAbstract
Context. The development of effective distance metrics and similarity measures for categorical features is an important task in data analysis, machine learning, and decision theory since a significant portion of object properties is described by non-numerical values. Typically, the dependence between categorical features may be more complex than simply comparing them for equality or inequality. Such attributes can be relatively similar, and to construct an effective model, it is necessary to consider this similarity when calculating distance or similarity measures.
Objective. The aim of the study is to improve the efficiency of solving practical data analysis problems by developing mathematical tools for determining the similarity of objects based on categorical ordered features.
Method. A distance based on weighted Manhattan distance and a similarity measure for determining the similarity of objects based on categorical ordinal features (i.e. a linear order with scales of preference considering the problem domain can be specified on the attribute value set) are proposed. It is proven that the distance formula satisfies the axioms of non-negativity, symmetry, triangle inequality, and upper bound, and therefore is a distance metric in the space of ranked categorical features. It is also proven that the similarity measure presented in the study satisfies the axioms of boundedness, symmetry, maximum and minimum similarity, and is described by a decreasing function.
Results. The developed approach has been implemented in an applied problem of determining the degree of similarity between objects described by ordered categorical features.
Conclusions. In this study, mathematical tools were developed to determine similarity between structured data described by categorical attributes that can be ordered based on a specific priority in the form of a ranking system with preferences. Their properties were analyzed. Experimental studies have shown the convenience and “intuitive understanding” of the logic of data processing in solving practical problems. The proposed approach can provide the opportunity to conduct new meaningful research in data analysis. Prospects for further research lie in the experimental use of the proposed tools in practical tasks and in studying their effectiveness.
References
Suárez J., García S., Herrera F. A tutorial on distance metric learning: Mathematical foundations, algorithms, experimental analysis, prospects and challenges, Neurocomputing, 2021, № 425, pp. 300–322. DOI: 10.1016/j.neucom.2020.08.017
Mathisen B., Aamodt A., Bach K., Langseth H. Learning similarity measures from data, Progress in Artificial Intelligence, 2019, № 9, pp. 1–15. DOI: 9. 10.1007/s13748019-00201-2.
Desai A., Singh H., Pudi V. et al. DISC: Data-Intensive Similarity Measure for Categorical Data, Advances in Knowledge Discovery and Data Mining, 2011, №6635, pp. 469–481. DOI: 10.1007/978-3-642-20847-8_39
Cunningham P. A Taxonomy of Similarity Mechanisms for Case-Based Reasoning, IEEE Transactions on Knowledge and Data Engineering, 2009, №21, pp. 1532–1543. DOI: 10.1109/TKDE.2008.227.
Nikpour N., Aamodt A., Bach K. Bayesian-Supported Retrieval in BNCreek: A Knowledge-Intensive Case-Based Reasoning System, Case-Based Reasoning Research and Development, 2018, №11156, pp. 323–338. DOI: 10.1007/978-3-030-01081-2_22
Gabel T., Godehardt E., Hüllermeier E., Minor M. TopDown Induction of Similarity Measures Using Similarity, Case-Based Reasoning Research and Development, 2015, №9343, pp. 149–164. DOI: 10.1007/978-3-319-24586-7_11
Hoffer E., Ailon N. Deep Metric Learning Using Triplet Network, Similarity-Based Pattern Recognition, 2014, №9370, pp. 84–92. DOI: 10.1007/978-3-319-24261-3_7
Nguyen T., Dinh T., Sriboonchitta S., Huynh V. A method for k-means-like clustering of categorical data, Journal of Ambient Intelligence and Humanized Computing, 2019, pp. 1–11. DOI: 10.1007/s12652-019-01445-5.
Mathisen B., Aamodt A., Bach K., Langseth H. Learning similarity measures from data, Progress in Artificial Intelligence, 2020, №9, pp. 129–143. DOI: 10.1007/s13748019-00201-2
Dyussenbayev A. Age Periods Of Human Life, Advances in Social Sciences Research Journal, 2017, №4, pp. 258–263. DOI:10.14738/assrj.46.2924
Kondruk N. Clustering method based on fuzzy binary relation, Eastern-European Journal of Enterprise Technologies, 2017, № 2(4), pp. 10–16. DOI: 10.15587/1729–4061.2017.94961
Kondruk, N. E., Malyar M. M. Analysis of Cluster Structures by Different Similarity Measures, Cybernetics and Systems Analysis, 2021, №57, pp. 436–441. https://doi.org/10.1007/s10559-021-00368-4.
Kondruk N. E. Vykorystannja dovzhynnoi’ miry podibnosti v zadachah klasteryzacii’, Radio Electronics, Computer Science, Control, 2018, №3 (46), pp. 98–105. DOI: 10.15588/1607-3274-2018-3-11.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Н. Е. Кондрук
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.