K-NN’S NEAREST NEIGHBORS METHOD FOR CLASSIFYING TEXT DOCUMENTS BY THEIR TOPICS
DOI:
https://doi.org/10.15588/1607-3274-2023-3-9Keywords:
method, cluster, classification, text document, subject, ball tree algorithm, metricAbstract
Context. Optimization of the method of nearest neighbors k-NN for the classification of text documents by their topics and experimentally solving the problem based on the method.
Objective. The study aims to study the method of nearest neighbors k-NN for classifying text documents by their topics. The task of the study is to classify text documents by their topics based on a dataset for the optimal time and with high accuracy.
Method. The k-nearest neighbors (k-NN) method is a metric algorithm for automatic object classification or regression. The k-NN algorithm stores all existing data and categorizes the new point based on the distance between the new point and all points in the training set. For this, a certain distance metric, such as Euclidean distance, is used. In the learning process, k-NN stores all the data from the training set, so it belongs to the “lazy” algorithms since learning takes place at the time of classification. The algorithm makes no assumptions about the distribution of data and it is nonparametric. The task of the k-NN algorithm is to assign a certain category to the test document x based on the categories k of the nearest neighbors from the training dataset. The similarity between the test document x and each of the closest neighbors is scored by the category to which the neighbor belongs. If several of k’s closest neighbors belong to the same category, then the similarity score of that category for the test document x is calculated as the sum of the category scores for each of these closest neighbors. After that, the categories are ranked by score, and the test document is assigned to the category with the highest score.
Results. The k-NN method for classifying text documents has been successfully implemented. Experiments have been conducted with various methods that affect the efficiency of k-NN, such as the choice of algorithm and metrics. The results of the experiments showed that the use of certain methods can improve the accuracy of classification and the efficiency of the model.
Conclusions. Displaying the results on different metrics and algorithms showed that choosing a particular algorithm and metric can have a significant impact on the accuracy of predictions. The application of the ball tree algorithm, as well as the use of different metrics, such as Manhattan or Euclidean distance, can lead to improved results. Using clustering before applying k-NN has been shown to have a positive effect on results and allows for better grouping of data and reduces the impact of noise or misclassified points, which leads to improved accuracy and class distribution.
References
Tung A. K., Hou J., Han J. Spatial clustering in the presence of obstacles, The 17th Intern. conf. on data engineering (ICDE’01). Heidelberg, 2001, pp. 359–367. DOI: 10.1109/ICDM.2002.1184042
Boehm C., Kailing K., Kriegel H., Kroeger P. Density connected clustering with local subspace preferences, IEEE Computer Society. Proc. of the 4th IEEE Intern. conf. on data mining. Los Alamitos, 2004, pp. 27–34. DOI: 10.1007/978-0-38739940-9_605
Boyko N., Kmetyk-Podubinska K., Andrusiak I. Application of Ensemble Methods of Strengthening in Search of Legal Information, Lecture Notes on Data Engineering and Communications Technologies, 2021, Vol. 77, pp. 188–200. https://doi.org/10.1007/978-3-030-82014-5_13.
Boyko N., Hetman S., Kots I. Comparison of Clustering Algorithms for Revenue and Cost Analysis, Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2021). Lviv, Ukraine, 2021, Vol. 1, pp. 1866–1877.
Procopiuc C. M., Jones M., Agarwal P. K., Murali T. M. A Monte Carlo algorithm for fast projective clustering, ACM SIGMOD Intern. conf. on management of data. Madison, Wisconsin, USA, 2002, pp. 418–427.
Sharma A., J. Nirmal Kumar S, Rana D., Setia S. A Review On Collaborative Filtering Using Knn Algorithm, OPJU International Technology Conference on Emerging Technologies for Sustainable Development (OTCON), 2023, pp. 1–6. DOI: 10.1109/OTCON56053.2023.10113985
Faye G. C. Gamboa, Matthew B. Concepcion, Antolin J. Alipio, Dan Michael A. Cortez, Andrew G. Bitancor, Myra S.J. Santos, Francis Arlando L. Atienza, Mark Anthony S. Mercado Further Enhancement of KNN Algorithm Based on Clustering Applied to IT Support Ticket Routing, 3rd International Conference on Computing, Networks and Internet of Things (CNIOT), 2022, pp. 186–190. DOI: 10.1109/CNIOT55862.2022.00040
Yang J.-K., Huang K.-Ch., Chung Ch.-Y., Chen Yu-Chi, Wu T.-W. Efficient Privacy Preserving Nearest Neighboring Classification from Tree Structures and Secret Sharing, IEEE International Conference on Communications, 2022, pp. 5615–5620. DOI: 10.1109/ICC45855.2022.9838718
Zhang Yu., Zhou Y., Xiao M., Shang X. Comment Text Grading for Chinese Graduate Academic Dissertation Using Attention Convolutional Neural Networks, 7th International Conference on Systems and Informatics (ICSAI), 2021, pp. 1–6. DOI: 10.1109/ICSAI53574.2021.9664159
Rohwinasakti S., Irawan B., Setianingsih C. Sentiment Analysis on Online Transportation Service Products Using K-Nearest Neighbor Method, International Conference on Computer, Information and Telecommunication Systems (CITS), 2021, pp. 1– 6.
Javid J., Ali Mughal M., Karim M. Using kNN Algorithm for classification of Distribution transformers Health index, International Conference on Innovative Computing (ICIC), 2021, pp. 1–6. DOI: 10.1109/ICIC53490.2021.9693013
Bansal A., Jain A. Analysis of Focussed Under-Sampling Techniques with Machine Learning Classifiers, IEEE/ACIS 19th International Conference on Software Engineering Research, Management and Applications (SERA), 2021, pp. 91–96. DOI: 10.1109/SERA51205.2021.9509270
Bellad Sagar. C., Mahapatra A., Ghule S. Dilip, Shetty S. Sridhar, Sountharrajan S, Karthiga M, Suganya Prostate Cancer Prognosis-a comparative approach using Machine Learning Techniques, 5th International Conference on Intelligent Computing and Control Systems (ICICCS), 2021, pp. 1722–1728. DOI: 10.1109/ICICCS51141.2021.9432173
Pokharkar Swapnil R., Wagh Sanjeev J., Deshmukh Sachin N. Machine Learning Based Predictive Mechanism for Internet Bandwidth, 6th International Conference for Convergence in Technology (I2CT), 2021, pp. 1–4. DOI: 10.1109/I2CT51068.2021.9418164
Chaudhary K., Poirion O. B., Lu L., Garmire L. X. Deep learning based multi-omics integration robustly predicts survival in liver cancer, Clin. Can. Res., 2017, 0853, pp. 1246–1259. doi: 10.1101/114892
Cheng B., Liu M., Zhang D., Musell B. C., Shen D. Domain Transfer Learning for MCI Conversion Prediction, IEEE Trans. Biomed. Eng., 2015, Vol. 62 (7), pp. 1805–1817. doi: 10.1109/TBME.2015.2404809
Huang M., Yang W., Feng Q., Chen W., Weiner M. W., Aisen P. Longitudinal measurement and hierarchical classification framework for the prediction of Alzheimer’s disease, Sci. Rep., 2017, Vol. 7, P. 39880. doi: 10.1038/srep39880
Hossain M. Z., Akhtar M. N., Ahmad R. B., Rahman M. A dynamic K-means clustering for data mining, Indonesian Journal of Electrical Engineering and Computer Science, 2017, Vol. 13 (2), pp. 521–526. DOI: http://doi.org/10.11591/ijeecs.v13.i2.pp521–526
Jothi R., Mohanty S. K., Ojha A. DK-means: a deterministic kmeans clustering algorithm for gene expression analysis, Pattern Analysis and Applications, 2019, Vol. 22(2), pp. 649–667. DOI: 10.1007/s10044-017-0673-0
Polyakova M. V., Krylov V. N. Data normalization methods toimprovethe quality ofclassificationin the breast cancerdiagnostic system, Applied Aspects of Information Technology, 2022, Vol. 5(1), pp. 55–63. DOI: https://doi.org/10.15276/aait.05.2022.5
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Н. І. Бойко, В. Ю. Михайлишин
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.