EVALUATION OF INFORMATIVITY AND SELECTION OF INSTANCES BASED ON HASHING
DOI:
https://doi.org/10.15588/1607-3274-2020-3-12Keywords:
Іnstance, attribute, informativeness, hashing, hash, reduction of the sample size.Abstract
Context. To reduce the data dimensionality in the diagnostic and recognition model construction, it becomes necessary to select the most informative instances, as well as to select the most informative features. The time spent on the separate implementation of these procedures is high due to the iterativity and interconnectedness of these procedures.
Objective. The purpose of this work is to reduce the time spent on reducing the data dimensionality by creating a method for selecting the most informative instances based on hashing.
Method. A method for calculating weights for determining the hashes of instances is proposed, which determines the weights of features based on their ranks in a deterministic way, which, in turn, determines, taking into account the number of equal partitions of the ranges of features, the minimum sufficient to distinguish clusters on the axis of the feature with acceptable accuracy. This eliminates the need for iterative enumeration of various combinations of features, determining random projections of features, as well as solving iterative optimization problems of finding the best projection of features, which significantly reduces the time spent on calculating weights, while ensuring the local sensitivity of the hash. The hashes obtained can be used both for the selection of instances and for the selection of features.
A method for determining the individual and group significance of sample instances is proposed, in which it uses the distance between the hashes of the instances as a measure of similarity and, by analogy with the potential method, finds the potentials induced by the classes for each instance, and on their basis determines the indicators of the significance of the instances, based on the fact that the instance in the feature space, the more informative the less the minimum potential difference of the classes induced on the specimen.
A method for determining the estimates of the informativeness of features is proposed, which, on the basis of normalizing the weights obtained during the formation of hashes, determines the indicators of the informativeness of features, giving preference to features with a smaller number of partitions.
Results. An experimental study has been carried out, which has confirmed the efficiency of the proposed methods in solving practical problems.
Conclusions. The developed software can be recommended for solving problems of data dimension reduction.
References
Jensen R., Shen Q. Computational intelligence and feature selection: rough and fuzzy approaches. Hoboken, John Wiley & Sons, 2008, 300 p.
Subbotin S., Oliinyk A. Eds.: Szewczyk R., Kaliczyńska M. The Dimensionality Reduction Methods Based on Computational Intelligence in Problems of Object Classification and Diagnosis, Recent Advances in Systems, Control and Information Technology. Cham, Springer, 2017, pp. 11–19. DOI: 10.1007/978-3-319-48923-0_2
Subbotin S. The instance and feature selection for neural network based diagnosis of chronic obstructive bronchitis, Applications of Computational Intelligence in Biomedical Technology. Cham, Springer, 2016, pp. 215– 228. DOI: 10.1007/978-3-319-19147-8_13
Chaudhuri A., Stenger H. Survey sampling theory and methods. New York, Chapman & Hall, 2005, 416 p. DOI: 10.1201/9781420028638
Subbotin S.A. Methods of sampling based on exhaustive and evolutionary search, Automatic Control and Computer Sciences, 2013, Vol. 47, No. 3, pp. 113–121. DOI: 10.3103/s0146411613030073
Lavrakas P.J. Encyclopedia of survey research methods. Thousand Oaks, Sage Publications, 2008, Vol. 1–2, 968 p. DOI: 10.4135/9781412963947.n159
Subbotin S.A. The sample properties evaluation for pattern recognition and intelligent diagnosis, Digital Technologies : 10th International Conference, Zilina, 9–11 July 2014 : proceedings. Los Alamitos, IEEE, 2014, pp. 332–343. DOI: 10.1109/dt.2014.6868734
Łukasik S., Kulczycki P. An algorithm for sample and data dimensionality reduction using fast simulated annealing, Advanced Data Mining and Applications, Lecture Notes in Computer Science. Berlin, Springer, 2011, Vol. 7120, pp. 152–161. DOI: 10.1007/978-3-64225853-4_12
Subbotin S., Oliinyk A. Eds.: R. Szewczyk, M. Kaliczyńska The Sample and Instance Selection for Data Dimensionality Reduction, Recent Advances in Systems, Control and Information Technology. Cham, Springer, 2017, pp. 97–103. DOI: 10.1007/978-3-319-48923-0_13
Elavarasan N., Mani K. A Survey on Feature Extraction Techniques, International Journal of Innovative Research in Computer and Communication Engineering, 2015, Vol. 3, Issue 1, pp. 52–55. DOI: 10.15680/ijircce.2015.0301009 52 11. Alpaydin E. Introduction to Machine Learning. London, MIT Press, 2014, 640 p.
Weinberger K., Dasgupta A., Langford J., Smola A., Attenberg J. Feature Hashing for Large Scale Multitask
Learning, 26th Annual International Conference on Machine Learning (ICML '09) Montreal, June 2009 : proceedings. New York: ACM, 2009, pp. 1113–1120. DOI: 10.1145/1553374.1553516
Wolfson H. J., Rigoutsos I. Geometric Hashing: An Overview, IEEE Computational Science and Engineering, 1997, Vol. 4, № 4, pp. 10–21.
Gui J., Liu T., Sun Z., Tao D., Tan T. Fast supervised discrete hashing, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, Vol. 40, No. 2, pp. 490– 496. DOI: 10.1109/TPAMI.2017.2678475
Indyk P., Motwani R. Approximate nearest neighbors: towards removing the curse of dimensionality, The 30th annual ACM symposium on Theory of computing (STOC'98), Dallas, 23–26 of May 1998 : proceedings. – 1998, pp. 604–613. DOI:10.1145/276698.276876
Zhao K., Lu H., Mei J. Locality Preserving Hashing, Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI'14), Québec, 27–31 of July 2014 : proceedings. Palo Alto, AAAI Press, 2014, pp. 2874–2880.
Tsai Y.-H., Yang M.-H. Locality preserving hashing, 2014 IEEE International Conference on Image Processing (ICIP), Paris, 27–30 of October 2014: proceedings. Los Alamitos, IEEE, 2014, pp. 2988–2992. DOI: 10.1109/ICIP.2014.7025604.
Faure A. Perception et reconnaissance des formes. Paris, Editests, 1985, 286 p.
Fisher Iris dataset [Electronic resource]. Access mode: https://archive.ics.uci.edu/ml/datasets/Iris
Dubrovin V., Subbotin S., Morshchavka S., Piza D. The plant recognition on remote sensing results by the feedforward neural networks, International Journal of Smart Engineering System Design, 2001, Vol. 3, No. 4, pp. 251–256.
Subbotin S. A. Avtomaticheskaja sistema obnaruzhenija i raspoznavanija avtotransportnyh sredstv na izobrazhenii, Programmnye produkty i sistemy, 2010, No. 1, pp. 114– 116.
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2020 С. А. Субботин
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.