Consistent Scalable Processing of Data Streams in a Distributed Environment
- This thesis investigates consistency challenges in distributed stream processing systems. Prior work on this topic has made significant progress, with many ideas being implemented in state-of-the-art Stream Processing Engines (SPEs). In this thesis, we focus on formal modeling to better characterize existing problems and explore potential improvements.
We introduce a formal model of delivery guarantees and show that deterministic SPEs can theoretically achieve lower latency than non-deterministic ones for exactly-once guarantee. This is supported by experimental results demonstrating that a novel deterministic implementation performs better than current alternatives.
The thesis also presents a formal model for substream management, identifying a lower bound on the additional network traffic required for detecting substream termination. A corresponding framework is implemented that meets this bound and demonstrates improved performance over existing approaches.
These results contribute formal foundations and practical techniques for improving the performance and predictability of distributed stream processing systems.