MULTILINGUAL TEXT CLASSIFIER USING PRE-TRAINED UNIVERSAL SENTENCE ENCODER MODEL
DOI:
https://doi.org/10.15588/1607-3274-2022-3-10Keywords:
few shot learning, low-data learning, pre-trained models, USE, neural networks, data mining, data set, text data classifier.Abstract
Context. Online platforms and environments continue to generate ever-increasing content. The task of automating the moderation of user-generated content continues to be relevant. Of particular note are cases in which, for one reason or another, there is a very small amount of data to teach the classifier. To achieve results under such conditions, it is important to involve the classifier pre-trained models, which were trained on a large amount of data from a wide range. This paper deals with the use of the pre-trained multilingual Universal Sentence Encoder (USE) model as a component of the developed classifier and the affect of hyperparameters on the classification accuracy when learning on a small data amount (~ 0.05% of the dataset).
Objective. The goal of this paper is the investigation of the pre-trained multilingual model and optimal hyperparameters influence for learning the text data classifier on the classification result.
Method. To solve this problem, a relatively new approach to few-shot learning has recently been used – learning with a relatively small number of examples. Since text data is still the dominant way of transmitting information, the study of the possibilities of constructing a classifier of text data when learning from a small number of examples (~ 0.002–0.05% of the data set) is an actual problem.
Results. It is shown that even with a small number of examples for learning (36 per class) due to the use of USE and optimal configuration in learning can achieve high accuracy of classification on English and Russian data, which is extremely important when it is impossible to collect your own large data set. The influence of the approach using USE and a set of different configurations of hyperparameters on the result of the text data classifier on the example of English and Russian data sets is evaluated.
Conclusions. During the experiments, a significant degree of relevance of the correct selection of hyperparameters is shown. In particular, this paper considered the batch size, optimizer, number of learning epochs and the percentage of data from the set taken to train the classifier. In the process of experimentation, the optimal configuration of hyperparameters was selected, according to which 86.46% accuracy of classification on the Russian-language data set and 91.13% on the English-language data, respectively, can be achieved in ten seconds of training (training time can be significantly affected by technical means used).
References
Yann L., Yoshua B., Geoffrey H. Deep learning, Nature, 2015, Vol. 521(7553), pp. 436–444.
Ma L., Goharian N., Chowdhury A. et al. Extracting unstructured data from template generated web documents, Information and knowledge management, Twelfth international conference, 2003, proceedings, 2003, pp. 512–515.
Orlovskyi O., Ostapov S. Analysis of the text preprocessing methods influence on the destructive messages classifier, O.Orlovskyi, Advanced Information Systems, 2020, Vol. 4(3), pp.104–108.
Few-Shot Text Classification with Triplet Networks, Data Augmentation, and Curriculum Learning [Electronic resource], Access mode: https://arxiv.org/abs/2103.07552
A Neural Few-Shot Text Classification Reality Check [Electronic resource]. Access mode: https://arxiv.org/abs/2101.12073
Few-Shot Text Generation with Pattern-Exploiting Training [Electronic resource]. Access mode: https://arxiv.org/abs/2012.11926
Halder K., Akbik A., Krapac J. et al. Task-Aware Representation of Sentences for Generic Text Classification, Computational Linguistics, 28th International Conference, December 2020, proceedings, 2020, P. 3202–3213.
Reddy T., Williams R., Breazeal C. Text classification for AI education [Electronic resource]. Access mode: https://robots.media.mit.edu/wpcontent/uploads/sites/7/2021/01/Text_classifier.pdf
Universal-sentence-encoder-multilingual-large. 16 languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian) text encoder [Electronic resource]. Access mode: https://tfhub.dev/google/universal-sentence-encodermultilingual-large/3
Sriporn K., Tsai C. F., Tsai C. E. et al. Analyzing Malaria Disease Using Effective Deep Learning Approach, Diagnostics, 2020, No. 10, pp. 744–749.
Fake or real news dataset [Electronic resource]. Access mode:https://github.com/lutzhamel/fakenews/blob/master/data/fake_or_real_news.csv.
Russian Language Toxic Comments. Small dataset with labeled comments from 2ch.hk and pikabu.ru [Electronic resource]. Access mode: https://www.kaggle.com/blackmoon/russian-language-toxiccomments.
Yang Y.. Cer D., Ahmad A. et al. Multilingual Universal Sentence Encoder for Semantic Retrieval, [Electronic resource]. Access mode: https://aclanthology.org/2020.acldemos.12.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 O. V. Orlovskiy, Khalili Sohrab, S. E. Ostapov, K. P. Hazdyuk, L. M. Shumylyak
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.