INFORMATION TECHNOLOGY FOR RECOGNIZING PROPAGANDA, FAKES AND DISINFORMATION IN TEXTUAL CONTENT BASED ON NLP AND MACHINE LEARNING METHODS
DOI:
https://doi.org/10.15588/1607-3274-2024-2-13Keywords:
disinformation, fake, propaganda, linguistic analysis, natural language processing, machine learning, cyber warfare, artificial intelligence, semantic analysis, information securityAbstract
Context. The research is aimed at the application of artificial intelligence for the development and improvement of means of cyber warfare, in particular for combating disinformation, fakes and propaganda in the Internet space, identifying sources of disinformation and inauthentic behavior (bots) of coordinated groups. The implementation of the project will contribute to solving the important and currently relevant issue of information manipulation in the media, because in order to effectively fight against distortion and disinformation, it is necessary to obtain an effective tool for recognizing these phenomena in textual data in order to develop a further strategy to prevent the spread of such data.
Objective of the study is to develop or automatic recognition of political propaganda in textual data, which is built on the basis of machine learning with a teacher and implemented using natural language processing methods.
Method. Recognition of the presence of propaganda will occur at two levels: at the general level, that is, at the level of the document, and at the level of individual sentences. To implement the project, such feature construction methods as the TF-IDF statistical indicator, the “Bag of Words” vectorization model, the marking of parts of speech, the word2vec model for obtaining vector representations of words, as well as the recognition of trigger words (reinforcing words, absolute pronouns and “shiny” words). Logistic regression was used as the main modeling algorithm.
Results. Machine learning models have been developed to recognize propaganda, fakes and disinformation at the document (article) and sentence level. Both model scores are satisfactory, but the model for document-level propaganda recognition performed almost 1.2 times better (by 20%).
Conclusions. The created model shows excellent results in recognizing propaganda, fakes and disinformation in textual content based on NLP and machine learning methods. The analysis of the raw data showed that the propaganda recognition model at the document (article) level was able to correctly classify 6097 non-propaganda articles and 694 propaganda articles. 123 propaganda articles and 285 non-propaganda articles were misclassified. The obtained estimate of the model: 0.9433254618697041. The sentence-level propaganda recognition model successfully classified 205 propaganda articles and 1917 non-propaganda articles. The model score is: 0.7437784787942516 (but 731 articles were incorrectly classified).
References
Zhao Y., Da J., Yan J. Detecting health misinformation in online health communities: Incorporating behavioral features into machine learning based approaches, Information Processing & Management, 2021, Vol. 58(1), P. 102390. DOI: 10.1016/j.ipm.2020.102390
Hartmann M., Golovchenko Y., Augenstein I. Mapping (dis-)information flow about the MH17 plane crash, arXiv. Access mode: https://arxiv.org/abs/1910.01363.
Prokipchuk O., Vysotska V. Ukrainian Language Tweets Analysis Technology for Public Opinion Dynamics Change Prediction Based on Machine Learning, Radio Electronics, Computer Science, Control, 2023, Vol. 2(2023), pp. 103– 116. DOI: 10.15588/1607-3274-2023-2-11
Ahmed S., Kumar A. Classification of Censored Tweets in Chinese Language using XLNet, Fourth Workshop on NLP for Internet Freedom. Censorship, Disinformation, and Propaganda, Association for Computational Linguistics, Online, 2021, proceedings. Online: ACL, 2021, pp. 136– 139. DOI: 10.18653/v1/2021.nlp4if-1.21
Vysotska V., Mazepa S., Chyrun L., Brodyak O., Shakleina I., Schuchmann V. NLP Tool for Extracting Relevant Information from Criminal Reports or Fakes/Propaganda Content, Computer Sciences and Information Technologies : 17th International Conference, Lviv, 2022, November. Lviv, IEEE, 2021, pp. 93–98. DOI: 10.1109/CSIT56902.2022.10000563
Oliinyk V. A., Vysotska V., Burov Y., Mykich K., Fernandes V. B. Propaganda Detection in Text Data Based on NLP and Machine Learning, CEUR Workshop Proceedings, 2020, Vol. 2631, pp. 132–144.
Bjola C. Propaganda in the digital age, Global Affairs, 2017, Vol. 3(3), pp. 189–191. DOI: 10.1080/23340460.2017.1427694
Vosoughi S., Roy D., Aral S. The spread of true and false news online, Science, 2018, Vol. 359(6380), pp. 1146–1151. DOI: 10.1126/science.aap9559
Propaganda Definitions. Access mode: https://propaganda.qcri.org/annotations/definitions.html
Field A. Kliger D., Wintner S., Pan J., Jurafsky D., Tsvetkov Y. Framing and Agenda-setting in Russian News: a Computational Analysis of Intricate Political Strategies, arXiv. Access mode: https://arxiv.org/abs/1808.09386
Garcia-Marín J., Calatrava A. The Use of Supervised Learning Algorithms in Political Communication and Media Studies: Locating Frames in the Press, Pamplona, 2018, Vol. 31(3), pp. 175–188. DOI: 10.15581/003.31.3.175-188
nginx. – Access mode: https://fgz.texty.org/
texty.org.ua. How Texty detects and makes sense of manipulative news. Access mode: https://medium.com/@texty.org.ua/how-texty-detects-andmakes-sense-of-manipulative-news-1f43d33936eb
Hein V. Propaganda detection in Russian and American news coverage about the war in Ukraine through text classification, Diploma Thesis, Technische Universität Wien, 2023. DOI: 10.34726/hss.2023.104640
Ceușan I. F. European Union policies and strategies to counter Russian propaganda and disinformation, L’Europe Unie, 2023, Vol. 19(19), pp. 113–122.
Perdoor S. Fake News Detection with LSTM and NLP – ProRew1. Access mode: https://www.kaggle.com/code/superrajdoor/fake-newsdetection-with-lstm-and-nlp-prorew1/input //
Duratnir İ. Fake News Detection with NLP and LSTM / İ. Duratnir. Access mode: https://www.kaggle.com/code/ilaydadu/fake-newsdetection-with-nlp-and-lstm
propaganda-detection-our-data. Access mode: https://www.kaggle.com/datasets/vladimirsydor/propaganda -detection-our-data
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 В. А. Висоцька
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.