EVALUATION OF QUANTIZED LARGE LANGUAGE MODELS IN THE TEXT SUMMARIZATION PROBLEM

Authors

  • N. I. Nedashkovskaya National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv,Ukraine, Ukraine
  • R. I. Yeremichuk National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”,Kyiv,Ukraine, Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2025-2-12

Keywords:

limited resources, natural language processing, text summarization, large language models, quantization, multicriteria analysis

Abstract

Context. The problem of increasing the efficiency of deep artificial neural networks in terms of memory and energy consumption, and the multi-criteria evaluation of the quality of the results of large language models (LLM) taking into account the judgments of users in the task of summarizing texts, are considered. The object of the study is the process of automated text summarization based on LLMs.
Objective. The goal of the work is to find a compromise between the complexity of the LLM, its performance and operational efficiency in text summarization problem.
Method. An LLM evaluation algorithm based on multiple criteria is proposed, which allows choosing the most appropriate LLM model for text summarization, finding an acceptable compromise between the complexity of the LLM model, its performance and the quality of text summarization. A significant improvement in the accuracy of results based on neural networks in natural language processing tasks is often achieved by using models that are too deep and over-parameterized, which significantly limits the ability of the models to be used in real-time inference tasks, where high accuracy is required under conditions of limited resources. The proposed algorithm selects an acceptable LLM model based on multiple criteria, such as accuracy metrics BLEU, Rouge-1, 2, Rouge-L, BERT-scores, speed of text generalization, or other criteria defined by the user in a specific practical task of intellectual analysis. The algorithm includes analysis and improvement of consistency of user judgments, evaluation of LLM models in terms of each criterion.
Results. Software is developed for automatically extracting texts from online articles and summarizing these texts. Nineteen quantized and non-quantized LLM models of various sizes were evaluated, including LLaMa-3-8B-4bit, Gemma-2B-4bit, Gemma- 1.1-7B-4bit, Qwen-1.5-4B-4bit, Stable LM-2-1.6B-4bit, Phi-2-4bit, Mistal-7B-4bit, GPT-3.5 Turbo and other LLMs in terms of BLEU, Rouge-1, Rouge-2, Rouge-L and BERT-scores on two different datasets: XSum and CNN/ Daily Mail 3.0.0.
Conclusions. The conducted experiments have confirmed the functionality of the proposed software, and allow to recommend it for practical use for solving the problems of text summarizing. Prospects for further research may include deeper analysis of metrics and criteria for evaluating quality of generated texts, experimental research of the proposed algorithm on a larger number of practical tasks of natural language processing

Author Biographies

N. I. Nedashkovskaya, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Kyiv,Ukraine

Dr. Sc., Associate Professor at the Department of Mathematical Methods of System
Analysis, Institute for Applied Systems Analysis

R. I. Yeremichuk, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”,Kyiv,Ukraine

Bachelor of Systems Analysis

References

Brown T., Mann B., Ryder N. et al. Language models are few-shot learners, Advances in neural information processing systems, 2020, Vol. 33, pp. 1877–1901. DOI: arXiv:2005.14165

Xie, Q. Bishop J. A., Tiwari P. et al. Pre-trained language models with domain knowledge for biomedical extractive summarization, Knowledge-Based Systems, 2022, Vol. 252. DOI: 10.1016/j.knosys.2022.109460

Basyal L., Sanghvi M. Text summarization using large language models, ArXiv, 2023. DOI: 2310.10449

OpenAI GPT 3.5 Turbo [Electronic resource]. Access mode: https://platform.openai.com/docs/models/gpt-3-5-turbo

OpenAI GPT-4 [Electronic resource]. Access mode: https://openai.com/index/gpt-4

Xu J., Ju D., Li M. et al. Recipes for safety in open-domain chatbots, ArXiv, 2021. DOI: 2010.07079

Meta LLaMa 3 [Electronic resource]. Access mode: https://llama.meta.com/llama3

McCulloch W. S., Pitts W. A logical calculus of the ideas immanent in nervous activity, Bulletin of Mathematical Biophysics, 1943, Vol. 5, № 4, pp. 115– 133. DOI: 10.1007/BF02478259

VanRullen R. Is perception discrete or continuous? / R. VanRullen, C. Koch // Trends in cognitive sciences. – 2003. – Vol. 7, № 5. – P. 207–213. DOI: 10.1016/S1364-6613(03)00095-0

Tee J., Taylor D. P. Is information in the brain represented in continuous or discrete form?, IEEE Transactions on Molecular, Biological and Multi-Scale Communications, 2020, Vol. 6, № 3, pp. 199–209. DOI: 1805.01631

Faisal A. A., Selen L. P. J., Wolpert D. M. Noise in the nervous system, Nature reviews neuroscience, 2008, Vol. 9, № 4, pp. 292–303. DOI: 10.1038/nrn2258

Varshney L. R., Varshney K. R. Decision making with quantized priors leads to discrimination, Proceedings of the IEEE, 2016, Vol. 105, № 2, pp. 241–255. DOI:10.1109/JPROC.2016.2608741

Varshney L. R., Sjöström P. J., Chklovskii D. B. Optimal information storage in noisy synapses under resource constraints, Neuron, 2006, Vol. 52, № 3, pp. 409–423. DOI: 10.1016/j.neuron.2006.10.017

Hinton G., Dean J., Vinyals O. Distilling the knowledge in a neural network, NIPS 2014 Deep Learning Workshop, 2015, pp. 1–9. DOI: 1503.02531

Mishra A. D. Marr Apprentice: using knowledge distillation techniques to improve low-precision network accuracy, ArXiv, 2017. DOI: 1711.05852

Polino A., Pascanu R., Alistarh D. Model compression via distillation and quantization, Proceedings of the Workshop at ICLR, 2018. DOI: 1802.05668

Mikolov T., Chen K., Corrado G. et al. Efficient estimation of word representations in vector space, Proceedings of the Workshop at ICLR, Scottsdale, 2013, pp. 1–12. DOI: 1301.3781

Pennington J., Socher R., Manning C. GloVe: global vectors for word representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. Association for Computational Linguistics, 2014, pp. 1532–1543.

Dai A. M., Le Q. V. Semi-supervised sequence learning, Advances in neural information processing systems, 2015. DOI: 1511.01432

McCann B., Bradbury J., Xiong C. et al. Learned in translation: contextualized word vectors, Advances in neural information processing systems. – 2017. – P. 6297–6308. DOI: 1708.00107

Peters M. E., Neumann M., Zettlemoyer L. et al. Dissecting contextual word embeddings: architecture and representation, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Brussels,

Belgium, 2018, pp. 1499–1509. DOI: 10.18653/v1/D18–1179

Gehrmann S., Deng Y., Rush A. M. Bottom-up abstractive summarization, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium, 2018, pp. 4098–4109.

See A., Liu P. J., Manning C. D. Get to the point: summarization with pointer-generator networks, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Canada, 2017, pp. 1073–1083.

Vaswani A., Shazeer N., Parmar N. et al. Attention is all you need, 31st Conference on Neural Information Processing Systems (NIPS 2017). Long Beach, CA, USA, 2017, pp. 6000–6010.

Devlin J., Chang M.-W., Lee K. et al. BERT: Pre-training of deep bidirectional transformers for language understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minnesota, 2019, pp. 4171–4186.

Zhang H., Cai J., Xu J., Wang J. Pretraining-based natural language generation for text summarization, Computational natural language learning, Hong Kong. China, 2019, pp. 789–797. DOI: 10.18653/v1/K19–1074

Zhang J., Zhao Y., Saleh M. et al. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization, Proceedings of the 37th International Conference on Machine Learning, 2020, pp. 11328–11339.

Raffel C., Shazeer N., Roberts A.et al. Exploring the limits of transfer learning with a unified text-to-text transformer, The Journal of Machine Learning Researc, 2020, Vol. 21, № 1, pp. 5485–5551.

Clark K., Luong M.-T., Le Q. V. et al. ELECTRA: Pre–training Text Encoders as Discriminators Rather Than Generators, 8th International Conference on Learning Representations, 2020 [Electronic resource]. Access mode: https://iclr.cc/virtual_2020/poster_r1xMH1BtvB.html

He P., Liu X., Gao J., Chen W. DeBERTa: decodingenhanced BERT with disentangled attention, ArXiv, 2021, DOI: 2006.03654

Chung H. W., Hou L., Longpre S. et al. Scaling instructionfinetuned language models, Journal of Machine Learning Research, 2024, Vol. 25, pp. 1–53.

Touvron H., Lavril T., Izacard G. et al. LLaMA: open and efficient foundation language models, ArXiv, 2023. DOI: 2302.13971

Gemini [Electronic resource]. Access mode: https://gemini.google.com/

Jiang A. Q., Sablayrolles A., Mensch A. et al. Mistral 7B, ArXiv, 2023. DOI: 2310.06825

Labonne M. Quantize LLaMa with GGUF and llama.cpp [Electronic resource], 2023. Access mode: https://towardsdatascience.com/quantize-llama-modelswith-ggml-and-llama-cpp-3612dfbcc172

Zmora N., Wu H., Rodge J. Achieving FP32 accuracy for INT8 inference using quantization aware training with NVIDIA TensorRT, NVIDIA Technical Blog. [Electronic resource]. Access mode: https://developer.nvidia.com/blog/achieving-fp32-accuracyfor-int8-inference-using-quantization-aware-training-withtensorrt/

Dettmers T., Lewis M., Belkada Y. et al. LLM.int8(): 8-bit matrix multiplication for transformers at scale, Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022, pp. 30318–30332. DOI: 2208.07339

Jacob B., Kligys S., Chen B. et al. Quantization and training of neural networks for efficient integer-arithmetic-only

inference, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2704–2713. DOI: 1712.05877

Wu H., Judd P., Zhang X. et al. Integer quantization for deep learning inference: Principles and empirical evaluation, ArXiv, 2020. DOI: 2004.09602

McKinstry J. L., Esser S. K., Appuswamy R. et al. Discovering low-precision networks close to full-precision networks for efficient embedded inference, ArXiv, 2019. DOI: 1809.04191

Baskin C., Schwartz E., Zheltonozhskii E. et al. Uniq: Uniform noise injection for non–uniform quantization of neural networks, ACM Transactions on Computer Systems, 2021, Vol. 37, № 1–4, pp. 1–15. DOI: 10.1145/3444943

Li Y. Dong X., Wang W. Additive powers-of-two quantization: an efficient nonuniform discretization for neural networks, ArXiv, 2020. DOI: 1909.13144

Fang J., Shafiee A., Abdel-Aziz H. et al. Post-training piecewise linear quantization for deep neural networks, Proceedings of the European Conference on Computer Vision, 2020, pp. 69–86. DOI: 2002.00104v2

Jung S., Son C., Lee S. et al. Learning to quantize deep networks by optimizing quantization intervals with task loss, Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4350–4359. DOI: 1808.05779

Hubara I., Courbariaux M., Soudry D. et al. Binarized neural networks, Proceedings of the 30th Conference on Neural Information Processing Systems (NIPS), 2016, pp. 4114–4122.

NVIDIA V100 TENSOR CORE GPU [Electronic resource]. Access mode: https://www.nvidia.com/enus/data-center/v100

Downloads

Published

2025-06-29

How to Cite

Nedashkovskaya, N. I. ., & Yeremichuk, R. I. (2025). EVALUATION OF QUANTIZED LARGE LANGUAGE MODELS IN THE TEXT SUMMARIZATION PROBLEM. Radio Electronics, Computer Science, Control, (2), 133–147. https://doi.org/10.15588/1607-3274-2025-2-12

Issue

Section

Neuroinformatics and intelligent systems