K-NN’S NEAREST NEIGHBORS METHOD FOR CLASSIFYING TEXT DOCUMENTS BY THEIR TOPICS

N. I. Boyko; V. Yu. Mykhailyshyn

doi:10.15588/1607-3274-2023-3-9

Authors

N. I. Boyko Lviv Polytechnic National University, Lviv, Ukraine , Ukraine
V. Yu. Mykhailyshyn Lviv Polytechnic National University, Lviv, Ukraine , Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2023-3-9

Keywords:

method, cluster, classification, text document, subject, ball tree algorithm, metric

Abstract

Context. Optimization of the method of nearest neighbors k-NN for the classification of text documents by their topics and experimentally solving the problem based on the method.

Objective. The study aims to study the method of nearest neighbors k-NN for classifying text documents by their topics. The task of the study is to classify text documents by their topics based on a dataset for the optimal time and with high accuracy.

Method. The k-nearest neighbors (k-NN) method is a metric algorithm for automatic object classification or regression. The k-NN algorithm stores all existing data and categorizes the new point based on the distance between the new point and all points in the training set. For this, a certain distance metric, such as Euclidean distance, is used. In the learning process, k-NN stores all the data from the training set, so it belongs to the “lazy” algorithms since learning takes place at the time of classification. The algorithm makes no assumptions about the distribution of data and it is nonparametric. The task of the k-NN algorithm is to assign a certain category to the test document x based on the categories k of the nearest neighbors from the training dataset. The similarity between the test document x and each of the closest neighbors is scored by the category to which the neighbor belongs. If several of k’s closest neighbors belong to the same category, then the similarity score of that category for the test document x is calculated as the sum of the category scores for each of these closest neighbors. After that, the categories are ranked by score, and the test document is assigned to the category with the highest score.

Results. The k-NN method for classifying text documents has been successfully implemented. Experiments have been conducted with various methods that affect the efficiency of k-NN, such as the choice of algorithm and metrics. The results of the experiments showed that the use of certain methods can improve the accuracy of classification and the efficiency of the model.

Conclusions. Displaying the results on different metrics and algorithms showed that choosing a particular algorithm and metric can have a significant impact on the accuracy of predictions. The application of the ball tree algorithm, as well as the use of different metrics, such as Manhattan or Euclidean distance, can lead to improved results. Using clustering before applying k-NN has been shown to have a positive effect on results and allows for better grouping of data and reduces the impact of noise or misclassified points, which leads to improved accuracy and class distribution.

Author Biographies

N. I. Boyko, Lviv Polytechnic National University, Lviv, Ukraine

Candidate of Economics, Associate Professor, Associate Professor of the Department of Artificial Intelligence Systems

V. Yu. Mykhailyshyn, Lviv Polytechnic National University, Lviv, Ukraine

Assistant Professor, Department of Artificial Intelligence Systems

References

Tung A. K., Hou J., Han J. Spatial clustering in the presence of obstacles, The 17th Intern. conf. on data engineering (ICDE’01). Heidelberg, 2001, pp. 359–367. DOI: 10.1109/ICDM.2002.1184042

Boehm C., Kailing K., Kriegel H., Kroeger P. Density connected clustering with local subspace preferences, IEEE Computer Society. Proc. of the 4th IEEE Intern. conf. on data mining. Los Alamitos, 2004, pp. 27–34. DOI: 10.1007/978-0-38739940-9_605

Boyko N., Kmetyk-Podubinska K., Andrusiak I. Application of Ensemble Methods of Strengthening in Search of Legal Information, Lecture Notes on Data Engineering and Communications Technologies, 2021, Vol. 77, pp. 188–200. https://doi.org/10.1007/978-3-030-82014-5_13.

Boyko N., Hetman S., Kots I. Comparison of Clustering Algorithms for Revenue and Cost Analysis, Proceedings of the 5th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2021). Lviv, Ukraine, 2021, Vol. 1, pp. 1866–1877.

Procopiuc C. M., Jones M., Agarwal P. K., Murali T. M. A Monte Carlo algorithm for fast projective clustering, ACM SIGMOD Intern. conf. on management of data. Madison, Wisconsin, USA, 2002, pp. 418–427.

Sharma A., J. Nirmal Kumar S, Rana D., Setia S. A Review On Collaborative Filtering Using Knn Algorithm, OPJU International Technology Conference on Emerging Technologies for Sustainable Development (OTCON), 2023, pp. 1–6. DOI: 10.1109/OTCON56053.2023.10113985

Faye G. C. Gamboa, Matthew B. Concepcion, Antolin J. Alipio, Dan Michael A. Cortez, Andrew G. Bitancor, Myra S.J. Santos, Francis Arlando L. Atienza, Mark Anthony S. Mercado Further Enhancement of KNN Algorithm Based on Clustering Applied to IT Support Ticket Routing, 3rd International Conference on Computing, Networks and Internet of Things (CNIOT), 2022, pp. 186–190. DOI: 10.1109/CNIOT55862.2022.00040

Yang J.-K., Huang K.-Ch., Chung Ch.-Y., Chen Yu-Chi, Wu T.-W. Efficient Privacy Preserving Nearest Neighboring Classification from Tree Structures and Secret Sharing, IEEE International Conference on Communications, 2022, pp. 5615–5620. DOI: 10.1109/ICC45855.2022.9838718

Zhang Yu., Zhou Y., Xiao M., Shang X. Comment Text Grading for Chinese Graduate Academic Dissertation Using Attention Convolutional Neural Networks, 7th International Conference on Systems and Informatics (ICSAI), 2021, pp. 1–6. DOI: 10.1109/ICSAI53574.2021.9664159

Rohwinasakti S., Irawan B., Setianingsih C. Sentiment Analysis on Online Transportation Service Products Using K-Nearest Neighbor Method, International Conference on Computer, Information and Telecommunication Systems (CITS), 2021, pp. 1– 6.

Javid J., Ali Mughal M., Karim M. Using kNN Algorithm for classification of Distribution transformers Health index, International Conference on Innovative Computing (ICIC), 2021, pp. 1–6. DOI: 10.1109/ICIC53490.2021.9693013

Bansal A., Jain A. Analysis of Focussed Under-Sampling Techniques with Machine Learning Classifiers, IEEE/ACIS 19th International Conference on Software Engineering Research, Management and Applications (SERA), 2021, pp. 91–96. DOI: 10.1109/SERA51205.2021.9509270

Bellad Sagar. C., Mahapatra A., Ghule S. Dilip, Shetty S. Sridhar, Sountharrajan S, Karthiga M, Suganya Prostate Cancer Prognosis-a comparative approach using Machine Learning Techniques, 5th International Conference on Intelligent Computing and Control Systems (ICICCS), 2021, pp. 1722–1728. DOI: 10.1109/ICICCS51141.2021.9432173

Pokharkar Swapnil R., Wagh Sanjeev J., Deshmukh Sachin N. Machine Learning Based Predictive Mechanism for Internet Bandwidth, 6th International Conference for Convergence in Technology (I2CT), 2021, pp. 1–4. DOI: 10.1109/I2CT51068.2021.9418164

Chaudhary K., Poirion O. B., Lu L., Garmire L. X. Deep learning based multi-omics integration robustly predicts survival in liver cancer, Clin. Can. Res., 2017, 0853, pp. 1246–1259. doi: 10.1101/114892

Cheng B., Liu M., Zhang D., Musell B. C., Shen D. Domain Transfer Learning for MCI Conversion Prediction, IEEE Trans. Biomed. Eng., 2015, Vol. 62 (7), pp. 1805–1817. doi: 10.1109/TBME.2015.2404809

Huang M., Yang W., Feng Q., Chen W., Weiner M. W., Aisen P. Longitudinal measurement and hierarchical classification framework for the prediction of Alzheimer’s disease, Sci. Rep., 2017, Vol. 7, P. 39880. doi: 10.1038/srep39880

Hossain M. Z., Akhtar M. N., Ahmad R. B., Rahman M. A dynamic K-means clustering for data mining, Indonesian Journal of Electrical Engineering and Computer Science, 2017, Vol. 13 (2), pp. 521–526. DOI: http://doi.org/10.11591/ijeecs.v13.i2.pp521–526

Jothi R., Mohanty S. K., Ojha A. DK-means: a deterministic kmeans clustering algorithm for gene expression analysis, Pattern Analysis and Applications, 2019, Vol. 22(2), pp. 649–667. DOI: 10.1007/s10044-017-0673-0

Polyakova M. V., Krylov V. N. Data normalization methods toimprovethe quality ofclassificationin the breast cancerdiagnostic system, Applied Aspects of Information Technology, 2022, Vol. 5(1), pp. 55–63. DOI: https://doi.org/10.15276/aait.05.2022.5

K-NN’S NEAREST NEIGHBORS METHOD FOR CLASSIFYING TEXT DOCUMENTS BY THEIR TOPICS

Authors

DOI:

Keywords:

Abstract

Author Biographies

N. I. Boyko, Lviv Polytechnic National University, Lviv, Ukraine

V. Yu. Mykhailyshyn, Lviv Polytechnic National University, Lviv, Ukraine

References

Downloads

Published

How to Cite

Issue

Section

License

Creative Commons Licensing Notifications in the Copyright Notices

Information

Current Issue