EVALUATING FAULT RECOVERY IN DISTRIBUTED APPLICATIONS FOR STREAM PROCESSING APPLICATIONS: BUSINESS INSIGHTS BASED ON METRICS

Authors

  • A. V Bashtovyi Lviv Polytechnic National University, Lviv, Ukraine, Ukraine
  • A. V. Fechan Lviv Polytechnic National University, Lviv, Ukraine, Ukraine

DOI:

https://doi.org/10.15588/1607-3274-2025-3-2

Keywords:

fault-tolerance, Kafka Streams, benchmarking, distributed systems, performance measurement, stream processing, SLO(Service level objectives)

Abstract

Context. Stream processing frameworks are widely used across industries like finance, e-commerce, and IoT to process real-time data streams efficiently. However, most benchmarking methodologies fail to replicate production-like environments, resulting in an incomplete evaluation of fault recovery performance. The object of this study is to evaluate stream processing frameworks under realistic conditions, considering preloaded state stores and business-oriented metrics.
Objective. The aim of this study is to propose a novel benchmarking methodology that simulates production environments with varying disk load states and introduces SLO-based metrics to assess the fault recovery performance of stream processing frameworks.
Method. The methodology involves conducting a series of experiments. The experiments were conducted on synthetic data generated by application using Kafka Streams in a Docker-based virtualized environment. The experiments evaluate system performance under three disk load scenarios: 0%, 50%, and 80% disk utilization. Synthetic failures are introduced during runtime, and key metrics such as throughput, latency, and consumer lag are tracked using JMX, Prometheus, and Grafana. The Business Fault Tolerance Impact (BFTI) metric is introduced to aggregate technical indicators into a simplified value, reflecting the business impact of fault recovery.
Results. The developed indicators have been implemented in software and investigated for solving the problems of Fisher’s Iris classification. The approach for evaluating fault tolerance in distributed stream processing systems has been implemented, additionally, the investigated effect on system performance under different disk utilization.
Conclusions. The findings underscore the importance of simulating real-world production environments in stream processing
benchmarks. The experiments demonstrate that disk load significantly affects fault recovery performance. Systems with disk utilization exceeding 80% show increased recovery times by 2.7 times and latency degradation up to fivefold compared to 0% disk load. The introduction of SLO-based metrics highlights the connection between system performance and business outcomes, providing stakeholders with more intuitive insights into application resilience. The findings underscore the importance of simulating real-world production environments in stream processing benchmarks. The BFTI metric provides a novel approach to translating technical performance into business-relevant indicators. Future work should explore adaptive SLO-based metrics, framework comparisons, and long-term performance studies to further bridge the gap between technical benchmarks and business needs.

Author Biographies

A. V Bashtovyi , Lviv Polytechnic National University, Lviv, Ukraine

Post-graduate student of the Department of Software

A. V. Fechan, Lviv Polytechnic National University, Lviv, Ukraine

Dr. Sc., Professor of the Software Department

References

Fragkoulis M., Carbone P., Kalavri V. et al. A survey on the evolution of stream processing systems, The VLDB Journal, 2024, Vol. 33, № 2, pp. 507–541. DOI: 10.1007/s00778-023-00819-8

Sasaki Y. A survey on IoT big data analytic systems: Current and future, IEEE Internet of Things Journal, 2022, Vol. 9, № 2, pp. 1024–1036. DOI: 10.1109/JIOT.2021.3131724

Bashtovyi A., Fechan A. Change data capture for migration to event-driven microservices: Case study, Proc. of the IEEE Int. Conf. on Computer Science and Information Technologies (CSIT), 2023, pp. 1–4. DOI: 10.1109/CSIT61576.2023.10324262

Vogel A., Henning S., Perez-Wohlfeil E. et al. A comprehensive benchmarking analysis of fault recovery in stream processing frameworks, Proc. of the 18th ACM Int. Conf. on Distributed and Event-Based Systems, 2024, pp. 171–182. DOI: 10.48550/arXiv.2404.06203

Marcotte P., Grégoire F., Petrillo F. Multiple faulttolerance mechanisms in cloud systems: A systematic review, 2019 IEEE Int. Conf. on Software Quality, Reliability and Security Companion (QRS-C), 2019, pp. 337– 344. DOI: 10.1109/ISSREW.2019.00104

Friedman E., Tzoumas K. Introduction to Apache Flink: Stream Processing for Real Time and Beyond. Sebastopol, O’Reilly Media, 2016, 322 p.

Wu H., Shang Z., Peng G., Wolter K. A reactive batching strategy of Apache Kafka for reliable stream processing in real-time, 2020 IEEE 31st Int. Symp. on Software Reliability Engineering (ISSRE), 2020, pp. 252–261. DOI: 10.1109/ISSRE5003.2020.00028

Van Dongen G., Van den Poel D. Evaluation of stream processing frameworks for fault tolerance and performance metrics, IEEE Access, 2021, Vol. 9, pp. 102349–102365. DOI: 10.1109/TPDS.2020.2978480

Venkataraman S., Yang Z., Parashar M. et al. Cost of fault-tolerance on data stream processing, Proc. of the VLDB Endowment, 2017, Vol. 10, № 11, pp. 1478–1491. DOI: 10.1007/978-3-030-10549-5_2

Grambow M. Benchmarking Microservice Platforms and Applications in the Cloud. Berlin, TU Berlin, 2024. [in press].

Henning S., Hasselbring W. Benchmarking scalability of stream processing frameworks deployed as microservices in the cloud, Journal of Systems and Software, 2024, Vol. 208, pp. 111879. – DOI: 10.1016/j.jss.2023.111879

Wang X., Zhang C., Fang J. et al. A comprehensive study on fault tolerance in stream processing systems, Frontiers of Computer Science, 2022, Vol. 16, P. 162603. DOI: 10.1007/s11704-020-0248-x

Hoseiny Farahabady M. R., Taheri J., Zomaya A. Y. et al. A dynamic resource controller for resolving quality of service issues in modern streaming processing engines, 2020 IEEE 19th Int. Symp. on Network Computing and Applications (NCA), 2020, pp. 1–8. DOI:10.1109/NCA51143.2020.9306697

Van Dongen G., Van den Poel D. A performance analysis of fault recovery in stream processing frameworks, IEEE Access, 2021, Vol. 9, pp. 93745–93763. DOI: 10.1109/ACCESS.2021.3093208

Van Dongen G. Open stream processing benchmark: an extensive analysis of distributed stream processing frameworks : Master’s thesis. Ghent, Ghent University, Faculty of Economics and Business Administration, 2021, 112 p.

Downloads

Published

2025-09-22

How to Cite

Bashtovyi , A. V., & Fechan, . A. V. . (2025). EVALUATING FAULT RECOVERY IN DISTRIBUTED APPLICATIONS FOR STREAM PROCESSING APPLICATIONS: BUSINESS INSIGHTS BASED ON METRICS. Radio Electronics, Computer Science, Control, (3), 17–27. https://doi.org/10.15588/1607-3274-2025-3-2

Issue

Section

Mathematical and computer modelling