Show report in:

UMINF 23.04

Towards Self-Driving Microservices

In recent years, microservice architecture has become a popular method for software system design and development. This involves creating applications with multiple small services, each with multiple instances, operating as independent processes. Due to the distributed nature of microservices, communication between services presents a challenging task that becomes increasingly complex as the number of services grows. This complexity can even lead to short-term failures that can degrade application performance. Therefore, the auto-tuning of inter-service communication is necessary to prevent such failures. Service meshes were introduced to offer the necessary technical capabilities that can be employed in such scenarios. In essence, a service mesh is an infrastructure layer that includes a set of configurable proxies integrated into microservices. This enables the provision of traffic management policies such as circuit breaking and retry mechanisms to enhance microservice resilience against transient failures. However, static configuration or misconfiguration of these mechanisms is unsuitable for the dynamic environment of microservices and can lead to serious issues and performance problems, such as retry storms. The goal of this thesis is three-fold. First, it aims to investigate the impact and effectiveness of service traffic management on application reliability and availability in the presence of transient failures. Second, it focuses on auto-tuning of service traffic management to increase carried throughput and maintain carried response time. Third, this research aims to propose measures that can improve research reproducibility in the area of distributed systems ensuring that the findings can be independently verified by others. In this thesis, we aim to offer detailed guidelines on best practices for implementing research software. To achieve these goals, this thesis delves into the current state-of-the-art in service meshes and eBPF-powered microservices, identifying current challenges and potential future directions. It analyzes the effects of circuit breaker and retry mechanisms on microservice performance and proposes adaptive controllers for both. The results show the need for such controllers that increase throughput while maintaining the tail response time of the application. Additionally, it proposes a microservice benchmark generator to enable systematic microservice benchmark generation and improve reproducibility. It also provides recommendations for improving artifact evaluation in distributed systems research by compiling all existing recommendations.

Keywords

Microservices, Autonomic Computing, Service Mesh, Reliability, Circuit Breaker, Retry, Microservice Resiliency, Microservice Benchmarking, Reproducibility, Repeatability.

Authors

Mohammad Reza Saleh Sedghpour

Back	Edit this report
Entry responsible: Mohammad Saleh

UMINF-series

Actions

Page Responsible: Frank Drewes
2025-07-03