EVALUATION OF QUANTIZED LARGE LANGUAGE MODELS IN THE TEXT SUMMARIZATION PROBLEM
DOI:
https://doi.org/10.15588/1607-3274-2025-2-12Keywords:
limited resources, natural language processing, text summarization, large language models, quantization, multicriteria analysisAbstract
Context. The problem of increasing the efficiency of deep artificial neural networks in terms of memory and energy consumption, and the multi-criteria evaluation of the quality of the results of large language models (LLM) taking into account the judgments of users in the task of summarizing texts, are considered. The object of the study is the process of automated text summarization based on LLMs.
Objective. The goal of the work is to find a compromise between the complexity of the LLM, its performance and operational efficiency in text summarization problem.
Method. An LLM evaluation algorithm based on multiple criteria is proposed, which allows choosing the most appropriate LLM model for text summarization, finding an acceptable compromise between the complexity of the LLM model, its performance and the quality of text summarization. A significant improvement in the accuracy of results based on neural networks in natural language processing tasks is often achieved by using models that are too deep and over-parameterized, which significantly limits the ability of the models to be used in real-time inference tasks, where high accuracy is required under conditions of limited resources. The proposed algorithm selects an acceptable LLM model based on multiple criteria, such as accuracy metrics BLEU, Rouge-1, 2, Rouge-L, BERT-scores, speed of text generalization, or other criteria defined by the user in a specific practical task of intellectual analysis. The algorithm includes analysis and improvement of consistency of user judgments, evaluation of LLM models in terms of each criterion.
Results. Software is developed for automatically extracting texts from online articles and summarizing these texts. Nineteen quantized and non-quantized LLM models of various sizes were evaluated, including LLaMa-3-8B-4bit, Gemma-2B-4bit, Gemma- 1.1-7B-4bit, Qwen-1.5-4B-4bit, Stable LM-2-1.6B-4bit, Phi-2-4bit, Mistal-7B-4bit, GPT-3.5 Turbo and other LLMs in terms of BLEU, Rouge-1, Rouge-2, Rouge-L and BERT-scores on two different datasets: XSum and CNN/ Daily Mail 3.0.0.
Conclusions. The conducted experiments have confirmed the functionality of the proposed software, and allow to recommend it for practical use for solving the problems of text summarizing. Prospects for further research may include deeper analysis of metrics and criteria for evaluating quality of generated texts, experimental research of the proposed algorithm on a larger number of practical tasks of natural language processing
References
Brown T., Mann B., Ryder N. et al. Language models are few-shot learners, Advances in neural information processing systems, 2020, Vol. 33, pp. 1877–1901. DOI: arXiv:2005.14165
Xie, Q. Bishop J. A., Tiwari P. et al. Pre-trained language models with domain knowledge for biomedical extractive summarization, Knowledge-Based Systems, 2022, Vol. 252. DOI: 10.1016/j.knosys.2022.109460
Basyal L., Sanghvi M. Text summarization using large language models, ArXiv, 2023. DOI: 2310.10449
OpenAI GPT 3.5 Turbo [Electronic resource]. Access mode: https://platform.openai.com/docs/models/gpt-3-5-turbo
OpenAI GPT-4 [Electronic resource]. Access mode: https://openai.com/index/gpt-4
Xu J., Ju D., Li M. et al. Recipes for safety in open-domain chatbots, ArXiv, 2021. DOI: 2010.07079
Meta LLaMa 3 [Electronic resource]. Access mode: https://llama.meta.com/llama3
McCulloch W. S., Pitts W. A logical calculus of the ideas immanent in nervous activity, Bulletin of Mathematical Biophysics, 1943, Vol. 5, № 4, pp. 115– 133. DOI: 10.1007/BF02478259
VanRullen R. Is perception discrete or continuous? / R. VanRullen, C. Koch // Trends in cognitive sciences. – 2003. – Vol. 7, № 5. – P. 207–213. DOI: 10.1016/S1364-6613(03)00095-0
Tee J., Taylor D. P. Is information in the brain represented in continuous or discrete form?, IEEE Transactions on Molecular, Biological and Multi-Scale Communications, 2020, Vol. 6, № 3, pp. 199–209. DOI: 1805.01631
Faisal A. A., Selen L. P. J., Wolpert D. M. Noise in the nervous system, Nature reviews neuroscience, 2008, Vol. 9, № 4, pp. 292–303. DOI: 10.1038/nrn2258
Varshney L. R., Varshney K. R. Decision making with quantized priors leads to discrimination, Proceedings of the IEEE, 2016, Vol. 105, № 2, pp. 241–255. DOI:10.1109/JPROC.2016.2608741
Varshney L. R., Sjöström P. J., Chklovskii D. B. Optimal information storage in noisy synapses under resource constraints, Neuron, 2006, Vol. 52, № 3, pp. 409–423. DOI: 10.1016/j.neuron.2006.10.017
Hinton G., Dean J., Vinyals O. Distilling the knowledge in a neural network, NIPS 2014 Deep Learning Workshop, 2015, pp. 1–9. DOI: 1503.02531
Mishra A. D. Marr Apprentice: using knowledge distillation techniques to improve low-precision network accuracy, ArXiv, 2017. DOI: 1711.05852
Polino A., Pascanu R., Alistarh D. Model compression via distillation and quantization, Proceedings of the Workshop at ICLR, 2018. DOI: 1802.05668
Mikolov T., Chen K., Corrado G. et al. Efficient estimation of word representations in vector space, Proceedings of the Workshop at ICLR, Scottsdale, 2013, pp. 1–12. DOI: 1301.3781
Pennington J., Socher R., Manning C. GloVe: global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. Association for Computational Linguistics, 2014, pp. 1532–1543.
Dai A. M., Le Q. V. Semi-supervised sequence learning, Advances in neural information processing systems, 2015. DOI: 1511.01432
McCann B., Bradbury J., Xiong C. et al. Learned in translation: contextualized word vectors, Advances in neural information processing systems. – 2017. – P. 6297–6308. DOI: 1708.00107
Peters M. E., Neumann M., Zettlemoyer L. et al. Dissecting contextual word embeddings: architecture and representation, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Brussels,
Belgium, 2018, pp. 1499–1509. DOI: 10.18653/v1/D18–1179
Gehrmann S., Deng Y., Rush A. M. Bottom-up abstractive summarization, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium, 2018, pp. 4098–4109.
See A., Liu P. J., Manning C. D. Get to the point: summarization with pointer-generator networks, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Canada, 2017, pp. 1073–1083.
Vaswani A., Shazeer N., Parmar N. et al. Attention is all you need, 31st Conference on Neural Information Processing Systems (NIPS 2017). Long Beach, CA, USA, 2017, pp. 6000–6010.
Devlin J., Chang M.-W., Lee K. et al. BERT: Pre-training of deep bidirectional transformers for language understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minnesota, 2019, pp. 4171–4186.
Zhang H., Cai J., Xu J., Wang J. Pretraining-based natural language generation for text summarization, Computational natural language learning, Hong Kong. China, 2019, pp. 789–797. DOI: 10.18653/v1/K19–1074
Zhang J., Zhao Y., Saleh M. et al. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization, Proceedings of the 37th International Conference on Machine Learning, 2020, pp. 11328–11339.
Raffel C., Shazeer N., Roberts A.et al. Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Researc, 2020, Vol. 21, № 1, pp. 5485–5551.
Clark K., Luong M.-T., Le Q. V. et al. ELECTRA: Pre–training Text Encoders as Discriminators Rather Than Generators, 8th International Conference on Learning Representations, 2020 [Electronic resource]. Access mode: https://iclr.cc/virtual_2020/poster_r1xMH1BtvB.html
He P., Liu X., Gao J., Chen W. DeBERTa: decodingenhanced BERT with disentangled attention, ArXiv, 2021, DOI: 2006.03654
Chung H. W., Hou L., Longpre S. et al. Scaling instructionfinetuned language models, Journal of Machine Learning Research, 2024, Vol. 25, pp. 1–53.
Touvron H., Lavril T., Izacard G. et al. LLaMA: open and efficient foundation language models, ArXiv, 2023. DOI: 2302.13971
Gemini [Electronic resource]. Access mode: https://gemini.google.com/
Jiang A. Q., Sablayrolles A., Mensch A. et al. Mistral 7B, ArXiv, 2023. DOI: 2310.06825
Labonne M. Quantize LLaMa with GGUF and llama.cpp [Electronic resource], 2023. Access mode: https://towardsdatascience.com/quantize-llama-modelswith-ggml-and-llama-cpp-3612dfbcc172
Zmora N., Wu H., Rodge J. Achieving FP32 accuracy for INT8 inference using quantization aware training with NVIDIA TensorRT, NVIDIA Technical Blog. [Electronic resource]. Access mode: https://developer.nvidia.com/blog/achieving-fp32-accuracyfor-int8-inference-using-quantization-aware-training-withtensorrt/
Dettmers T., Lewis M., Belkada Y. et al. LLM.int8(): 8-bit matrix multiplication for transformers at scale, Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022, pp. 30318–30332. DOI: 2208.07339
Jacob B., Kligys S., Chen B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only
inference, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713. DOI: 1712.05877
Wu H., Judd P., Zhang X. et al. Integer quantization for deep learning inference: Principles and empirical evaluation, ArXiv, 2020. DOI: 2004.09602
McKinstry J. L., Esser S. K., Appuswamy R. et al. Discovering low-precision networks close to full-precision networks for efficient embedded inference, ArXiv, 2019. DOI: 1809.04191
Baskin C., Schwartz E., Zheltonozhskii E. et al. Uniq: Uniform noise injection for non–uniform quantization of neural networks, ACM Transactions on Computer Systems, 2021, Vol. 37, № 1–4, pp. 1–15. DOI: 10.1145/3444943
Li Y. Dong X., Wang W. Additive powers-of-two quantization: an efficient nonuniform discretization for neural networks, ArXiv, 2020. DOI: 1909.13144
Fang J., Shafiee A., Abdel-Aziz H. et al. Post-training piecewise linear quantization for deep neural networks, Proceedings of the European Conference on Computer Vision, 2020, pp. 69–86. DOI: 2002.00104v2
Jung S., Son C., Lee S. et al. Learning to quantize deep networks by optimizing quantization intervals with task loss, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4350–4359. DOI: 1808.05779
Hubara I., Courbariaux M., Soudry D. et al. Binarized neural networks, Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS), 2016, pp. 4114–4122.
NVIDIA V100 TENSOR CORE GPU [Electronic resource]. Access mode: https://www.nvidia.com/enus/data-center/v100
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 N. I. Nedashkovskaya, R. I. Yeremichuk

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.