EVALUATING FAULT RECOVERY IN DISTRIBUTED APPLICATIONS FOR STREAM PROCESSING APPLICATIONS: BUSINESS INSIGHTS BASED ON METRICS
DOI:
https://doi.org/10.15588/1607-3274-2025-3-2Keywords:
fault-tolerance, Kafka Streams, benchmarking, distributed systems, performance measurement, stream processing, SLO(Service level objectives)Abstract
Context. Stream processing frameworks are widely used across industries like finance, e-commerce, and IoT to process real-time data streams efficiently. However, most benchmarking methodologies fail to replicate production-like environments, resulting in an incomplete evaluation of fault recovery performance. The object of this study is to evaluate stream processing frameworks under realistic conditions, considering preloaded state stores and business-oriented metrics.
Objective. The aim of this study is to propose a novel benchmarking methodology that simulates production environments with varying disk load states and introduces SLO-based metrics to assess the fault recovery performance of stream processing frameworks.
Method. The methodology involves conducting a series of experiments. The experiments were conducted on synthetic data generated by application using Kafka Streams in a Docker-based virtualized environment. The experiments evaluate system performance under three disk load scenarios: 0%, 50%, and 80% disk utilization. Synthetic failures are introduced during runtime, and key metrics such as throughput, latency, and consumer lag are tracked using JMX, Prometheus, and Grafana. The Business Fault Tolerance Impact (BFTI) metric is introduced to aggregate technical indicators into a simplified value, reflecting the business impact of fault recovery.
Results. The developed indicators have been implemented in software and investigated for solving the problems of Fisher’s Iris classification. The approach for evaluating fault tolerance in distributed stream processing systems has been implemented, additionally, the investigated effect on system performance under different disk utilization.
Conclusions. The findings underscore the importance of simulating real-world production environments in stream processing
benchmarks. The experiments demonstrate that disk load significantly affects fault recovery performance. Systems with disk utilization exceeding 80% show increased recovery times by 2.7 times and latency degradation up to fivefold compared to 0% disk load. The introduction of SLO-based metrics highlights the connection between system performance and business outcomes, providing stakeholders with more intuitive insights into application resilience. The findings underscore the importance of simulating real-world production environments in stream processing benchmarks. The BFTI metric provides a novel approach to translating technical performance into business-relevant indicators. Future work should explore adaptive SLO-based metrics, framework comparisons, and long-term performance studies to further bridge the gap between technical benchmarks and business needs.
References
Fragkoulis M., Carbone P., Kalavri V. et al. A survey on the evolution of stream processing systems, The VLDB Journal, 2024, Vol. 33, № 2, pp. 507–541. DOI: 10.1007/s00778-023-00819-8
Sasaki Y. A survey on IoT big data analytic systems: Current and future, IEEE Internet of Things Journal, 2022, Vol. 9, № 2, pp. 1024–1036. DOI: 10.1109/JIOT.2021.3131724
Bashtovyi A., Fechan A. Change data capture for migration to event-driven microservices: Case study, Proc. of the IEEE Int. Conf. on Computer Science and Information Technologies (CSIT), 2023, pp. 1–4. DOI: 10.1109/CSIT61576.2023.10324262
Vogel A., Henning S., Perez-Wohlfeil E. et al. A comprehensive benchmarking analysis of fault recovery in stream processing frameworks, Proc. of the 18th ACM Int. Conf. on Distributed and Event-Based Systems, 2024, pp. 171–182. DOI: 10.48550/arXiv.2404.06203
Marcotte P., Grégoire F., Petrillo F. Multiple faulttolerance mechanisms in cloud systems: A systematic review, 2019 IEEE Int. Conf. on Software Quality, Reliability and Security Companion (QRS-C), 2019, pp. 337– 344. DOI: 10.1109/ISSREW.2019.00104
Friedman E., Tzoumas K. Introduction to Apache Flink: Stream Processing for Real Time and Beyond. Sebastopol, O’Reilly Media, 2016, 322 p.
Wu H., Shang Z., Peng G., Wolter K. A reactive batching strategy of Apache Kafka for reliable stream processing in real-time, 2020 IEEE 31st Int. Symp. on Software Reliability Engineering (ISSRE), 2020, pp. 252–261. DOI: 10.1109/ISSRE5003.2020.00028
Van Dongen G., Van den Poel D. Evaluation of stream processing frameworks for fault tolerance and performance metrics, IEEE Access, 2021, Vol. 9, pp. 102349–102365. DOI: 10.1109/TPDS.2020.2978480
Venkataraman S., Yang Z., Parashar M. et al. Cost of fault-tolerance on data stream processing, Proc. of the VLDB Endowment, 2017, Vol. 10, № 11, pp. 1478–1491. DOI: 10.1007/978-3-030-10549-5_2
Grambow M. Benchmarking Microservice Platforms and Applications in the Cloud. Berlin, TU Berlin, 2024. [in press].
Henning S., Hasselbring W. Benchmarking scalability of stream processing frameworks deployed as microservices in the cloud, Journal of Systems and Software, 2024, Vol. 208, pp. 111879. – DOI: 10.1016/j.jss.2023.111879
Wang X., Zhang C., Fang J. et al. A comprehensive study on fault tolerance in stream processing systems, Frontiers of Computer Science, 2022, Vol. 16, P. 162603. DOI: 10.1007/s11704-020-0248-x
Hoseiny Farahabady M. R., Taheri J., Zomaya A. Y. et al. A dynamic resource controller for resolving quality of service issues in modern streaming processing engines, 2020 IEEE 19th Int. Symp. on Network Computing and Applications (NCA), 2020, pp. 1–8. DOI:10.1109/NCA51143.2020.9306697
Van Dongen G., Van den Poel D. A performance analysis of fault recovery in stream processing frameworks, IEEE Access, 2021, Vol. 9, pp. 93745–93763. DOI: 10.1109/ACCESS.2021.3093208
Van Dongen G. Open stream processing benchmark: an extensive analysis of distributed stream processing frameworks : Master’s thesis. Ghent, Ghent University, Faculty of Economics and Business Administration, 2021, 112 p.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 A. V Bashtovyi , A. V. Fechan

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Creative Commons Licensing Notifications in the Copyright Notices
The journal allows the authors to hold the copyright without restrictions and to retain publishing rights without restrictions.
The journal allows readers to read, download, copy, distribute, print, search, or link to the full texts of its articles.
The journal allows to reuse and remixing of its content, in accordance with a Creative Commons license СС BY -SA.
Authors who publish with this journal agree to the following terms:
-
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License CC BY-SA that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
-
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
-
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.