NATURAL LANGUAGE PROCESSING OF SOCIAL MEDIA TEXT DATA USING BERT AND XGBOOST

Authors

  • T. Batiuk Lviv Polytechnic National University, Lviv, Ukraine, Ukraine
  • D. Dosyn Lviv Polytechnic National University, Lviv, Ukraine, Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2025-2-14

Keywords:

Machine learning, feature normalization, Transformers, confusion matrix, Sentence-BERT, text data classification

Abstract

Context The growth of text data in social networks requires the development of effective methods for sentiment analysis that can take into account both lexical and contextual dependencies. Traditional approaches to text processing have limitations in understanding semantic relationships between words, which affects the accuracy of classification. The integration of deep neural networks for text vectorization with ensemble machine learning algorithms and methods for interpreting results allows improving the quality of sentiment analysis.
Objective. The aim of the study is to develop and evaluate a new approach to text message sentiment classification that combines Sentence-BERT for deep semantic vectorization, XGBoost for high-accuracy classification, SHAP for explaining the contribution of features, sentence embedding similarity for assessing semantic similarity, and λ-regularization to improve the generalization ability of the model. The study is aimed at analyzing the impact of these methods on the quality of classification, identifying the most significant features and optimizing parameters.
Method. The study uses Sentence-BERT to transform text data into a vector space with deep semantic connections. XGBoost is used for sentiment classification, which provides high accuracy and stability even on unevenly distributed datasets. The SHAP method is used to explain the contribution of features, which allows us to determine which factors have the greatest impact on the prediction. Additionally, sentence embedding similarity is used to compare texts.
Results. The proposed approach demonstrates high efficiency in mood classification tasks. The ROC-AUC value confirms the ability of the model to accurately distinguish between classes of emotional coloring of the text. The use of SHAP ensures the interpretability of the results, allowing us to explain the influence of each feature on the classification. Sentence embedding similarity confirms the efficiency of Sentence-BERT in detecting semantically
similar texts, and λ-regularization improves the generalization ability of the model.
Conclusions. The study demonstrates scientific novelty through a comprehensive combination of Sentence-BERT, XGBoost, SHAP, sentence embedding similarity, and λ-regularization to improve the accuracy and interpretability of sentiment analysis. The results obtained confirm the effectiveness of the proposed approach, which makes it promising for application in public opinion monitoring, automated content moderation, and personalized recommendation systems. Further research can be aimed at adapting the model to specific domains and improving interpretation methods.

Author Biographies

T. Batiuk, Lviv Polytechnic National University, Lviv, Ukraine

Post-graduate student of Information Systems and Networks Department

D. Dosyn , Lviv Polytechnic National University, Lviv, Ukraine

Doctor of Sciences, Professor of Information Systems and Networks Department

References

Mujahid M. Kına E., Rustam F., Villar M. G., Alvarado E. S., Diez I. D. L. T., Ashraf I. Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering, Journal of Big Data, 2024Vol. 11, No. 1, pp. 1–32. DOI: 10.1186/s40537-024-00943-4

Vladov S., Scislo L., Sokurenko V., Muzychuk O., Vysotska V., Osadchy S., Sachenko A. Neural network signal integration from thermogas-dynamic parameter sensors for helicopters turboshaft engines at flight operation conditions, Sensors, 2024, Vol. 24, No. 13, P. 4246. DOI: 10.3390/s24134246

Batiuk T., Dosyn D. Intellectual system for clustering users of social networks derived from the message sentiment analysis, Journal of Lviv Polytechnic National University “Information Systems and Networks”, 2023, Vol. 13, pp. 121–138. DOI: 10.23939/sisn2023.13.121

Jaradat S., Nayak R., Paz A., Elhenawy M. Ensemble learning with pre-trained transformers for crash severity classification: a deep N.L.P. approach, Algorithms, 2024, Vol. 17, No. 7, P. 284. DOI: 10.3390/a17070284

Vladov S., Yakovliev R., Vysotska V., Nazarkevych M., Lytvyn V. The method of restoring lost information from sensors based on auto-associative neural networks, Applied System Innovation, 2024, Vol. 7, No. 3, P. 53. DOI: 10.3390/asi7030053

Lin Y., Wang X., Yang J., Wang S. Core technology topic identification and evolution analysis based on patent text mining – a case study of unmanned ship, Applied Sciences, 2024, Vol. 14, No. 11, P. 4661. DOI: 10.3390/app14114661

Aldyaflah I., Zhao W., Yang S., Luo X. The impact of input types on smart contract vulnerability detection performance based on deep learning: a preliminary study, Information, 2024, Vol. 15, No. 6, P. 302. DOI: 10.3390/info15060302

Batiuk T., Dosyn D. A realization of visual biometric validation to enhance guarded and efficient authorization for intellectual systems, CEUR Workshop Proceedings, 8th Intern. Conf. on Computational Linguistics and Intelligent Systems COLINS 2024, 2024, Vol. 3668, pp. 247–268.

Ivokhin E., Oletsky O. Restructuring of the model “State–Probability of Choice” based on products of stochastic rectangular matrices, Cybernetics and Systems Analysis, 2022, Vol 58, No. 2, pp. 242–250. DOI: 10.1007/s10559-022-00456-z

Danylyk V., Vysotska V., Andrunyk V., Uhryn D., Ushenko Y. Information technology for the operational processing of military content for commanders of tactical army units, International Journal of Computer Network and Information Security, 2024, Vol. 16, No. 3, pp. 115–143. DOI: 10.5815/ijcnis.2024.03.09

Batiuk T., Dosyn D. Realization of the decisionmaking support system for Twitter users’ publications analysis, Radio Electronics Computer Science Control, 2024, Vol. 1, No. 24, pp. 175–187. DOI: 10.15588/1607-3274-2024-1-16

Oletsky O. Exploring dynamic equilibrium of alternatives on the base of rectangular stochastic matrices, CEUR Workshop Proceedings, Modern Machine Learning Technologies and Data Science Workshop MoMLeT&DS 2021, 2021, Vol. 2917, pp. 151–160.

Oletsky O. On constructing adjustable procedures for enhancing consistency of pairwise comparisons on the base of linear equations, CEUR Workshop Proceedings, 2021, Vol. 3106, pp. 177–185.

Lin Y., Liu T. Enhanced Transformer-BLSTM model for classifying sentiment of user comments on movies and books, IEEE Access, 2024, pp. 1–1. DOI: 10.1109/access.2024.3416755

Batiuk T., Vysotska V., Holoshchuk R., Holoshchuk S. Intellectual system for socialization of individuals with contributed interests derived from NLP, machine learning, and SEO algorithms, CEUR Workshop Proceedings, 6th Intern. Conf. on Computational Linguistics and Intellectual Systems COLINS 2022, 2022, Vol. 3171, pp. 572–631.

Oletsky O. A model of information influences on the base of rectangular stochastic matrices in chains of reasoning with possible contradictions, CEUR Workshop Proceedings, IT&I Workshops 2021, 2021, Vol. 3179, pp. 354–361.

Published

2025-06-29

How to Cite

Batiuk, T., & Dosyn , D. . (2025). NATURAL LANGUAGE PROCESSING OF SOCIAL MEDIA TEXT DATA USING BERT AND XGBOOST. Radio Electronics, Computer Science, Control, (2), 154–167. https://doi.org/10.15588/1607-3274-2025-2-14

Issue

Section

Progressive information technologies