NATURAL LANGUAGE PROCESSING OF SOCIAL MEDIA TEXT DATA USING BERT AND XGBOOST
DOI:
https://doi.org/10.15588/1607-3274-2025-2-14Keywords:
Machine learning, feature normalization, Transformers, confusion matrix, Sentence-BERT, text data classificationAbstract
Context The growth of text data in social networks requires the development of effective methods for sentiment analysis that can take into account both lexical and contextual dependencies. Traditional approaches to text processing have limitations in understanding semantic relationships between words, which affects the accuracy of classification. The integration of deep neural networks for text vectorization with ensemble machine learning algorithms and methods for interpreting results allows improving the quality of sentiment analysis.
Objective. The aim of the study is to develop and evaluate a new approach to text message sentiment classification that combines Sentence-BERT for deep semantic vectorization, XGBoost for high-accuracy classification, SHAP for explaining the contribution of features, sentence embedding similarity for assessing semantic similarity, and λ-regularization to improve the generalization ability of the model. The study is aimed at analyzing the impact of these methods on the quality of classification, identifying the most significant features and optimizing parameters.
Method. The study uses Sentence-BERT to transform text data into a vector space with deep semantic connections. XGBoost is used for sentiment classification, which provides high accuracy and stability even on unevenly distributed datasets. The SHAP method is used to explain the contribution of features, which allows us to determine which factors have the greatest impact on the prediction. Additionally, sentence embedding similarity is used to compare texts.
Results. The proposed approach demonstrates high efficiency in mood classification tasks. The ROC-AUC value confirms the ability of the model to accurately distinguish between classes of emotional coloring of the text. The use of SHAP ensures the interpretability of the results, allowing us to explain the influence of each feature on the classification. Sentence embedding similarity confirms the efficiency of Sentence-BERT in detecting semantically
similar texts, and λ-regularization improves the generalization ability of the model.
Conclusions. The study demonstrates scientific novelty through a comprehensive combination of Sentence-BERT, XGBoost, SHAP, sentence embedding similarity, and λ-regularization to improve the accuracy and interpretability of sentiment analysis. The results obtained confirm the effectiveness of the proposed approach, which makes it promising for application in public opinion monitoring, automated content moderation, and personalized recommendation systems. Further research can be aimed at adapting the model to specific domains and improving interpretation methods.
References
Mujahid M. Kına E., Rustam F., Villar M. G., Alvarado E. S., Diez I. D. L. T., Ashraf I. Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering, Journal of Big Data, 2024Vol. 11, No. 1, pp. 1–32. DOI: 10.1186/s40537-024-00943-4
Vladov S., Scislo L., Sokurenko V., Muzychuk O., Vysotska V., Osadchy S., Sachenko A. Neural network signal integration from thermogas-dynamic parameter sensors for helicopters turboshaft engines at flight operation conditions, Sensors, 2024, Vol. 24, No. 13, P. 4246. DOI: 10.3390/s24134246
Batiuk T., Dosyn D. Intellectual system for clustering users of social networks derived from the message sentiment analysis, Journal of Lviv Polytechnic National University “Information Systems and Networks”, 2023, Vol. 13, pp. 121–138. DOI: 10.23939/sisn2023.13.121
Jaradat S., Nayak R., Paz A., Elhenawy M. Ensemble learning with pre-trained transformers for crash severity classification: a deep N.L.P. approach, Algorithms, 2024, Vol. 17, No. 7, P. 284. DOI: 10.3390/a17070284
Vladov S., Yakovliev R., Vysotska V., Nazarkevych M., Lytvyn V. The method of restoring lost information from sensors based on auto-associative neural networks, Applied System Innovation, 2024, Vol. 7, No. 3, P. 53. DOI: 10.3390/asi7030053
Lin Y., Wang X., Yang J., Wang S. Core technology topic identification and evolution analysis based on patent text mining – a case study of unmanned ship, Applied Sciences, 2024, Vol. 14, No. 11, P. 4661. DOI: 10.3390/app14114661
Aldyaflah I., Zhao W., Yang S., Luo X. The impact of input types on smart contract vulnerability detection performance based on deep learning: a preliminary study, Information, 2024, Vol. 15, No. 6, P. 302. DOI: 10.3390/info15060302
Batiuk T., Dosyn D. A realization of visual biometric validation to enhance guarded and efficient authorization for intellectual systems, CEUR Workshop Proceedings, 8th Intern. Conf. on Computational Linguistics and Intelligent Systems COLINS 2024, 2024, Vol. 3668, pp. 247–268.
Ivokhin E., Oletsky O. Restructuring of the model “State–Probability of Choice” based on products of stochastic rectangular matrices, Cybernetics and Systems Analysis, 2022, Vol 58, No. 2, pp. 242–250. DOI: 10.1007/s10559-022-00456-z
Danylyk V., Vysotska V., Andrunyk V., Uhryn D., Ushenko Y. Information technology for the operational processing of military content for commanders of tactical army units, International Journal of Computer Network and Information Security, 2024, Vol. 16, No. 3, pp. 115–143. DOI: 10.5815/ijcnis.2024.03.09
Batiuk T., Dosyn D. Realization of the decisionmaking support system for Twitter users’ publications analysis, Radio Electronics Computer Science Control, 2024, Vol. 1, No. 24, pp. 175–187. DOI: 10.15588/1607-3274-2024-1-16
Oletsky O. Exploring dynamic equilibrium of alternatives on the base of rectangular stochastic matrices, CEUR Workshop Proceedings, Modern Machine Learning Technologies and Data Science Workshop MoMLeT&DS 2021, 2021, Vol. 2917, pp. 151–160.
Oletsky O. On constructing adjustable procedures for enhancing consistency of pairwise comparisons on the base of linear equations, CEUR Workshop Proceedings, 2021, Vol. 3106, pp. 177–185.
Lin Y., Liu T. Enhanced Transformer-BLSTM model for classifying sentiment of user comments on movies and books, IEEE Access, 2024, pp. 1–1. DOI: 10.1109/access.2024.3416755
Batiuk T., Vysotska V., Holoshchuk R., Holoshchuk S. Intellectual system for socialization of individuals with contributed interests derived from NLP, machine learning, and SEO algorithms, CEUR Workshop Proceedings, 6th Intern. Conf. on Computational Linguistics and Intellectual Systems COLINS 2022, 2022, Vol. 3171, pp. 572–631.
Oletsky O. A model of information influences on the base of rectangular stochastic matrices in chains of reasoning with possible contradictions, CEUR Workshop Proceedings, IT&I Workshops 2021, 2021, Vol. 3179, pp. 354–361.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Т. Batiuk, D. Dosyn

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.