Skip to the content.

Improving reliability efficiency through log mining and multi-objective optimizations

a. Project Description

Site Reliability Engineering or SRE is a set of practices, developed at Google, aiming at maintaining the quality and reliability of software systems at runtime and in production. This is potentially one of the most critical phases of a software system’s lifecycle as at this point it can directly affect the user experience and the provider’s business goals. One major challenge that reliability teams have to face is the sheer amount of events that can happen during runtime and the data that these events generate. Potential lack of processing throughput and missed events may lead to increased maintenance costs, but most importantly to lost revenue. Another important challenge is that SRE is often in constructive competition with the development and business teams; while the SRE team pushes for more stability and reliability, the development team pushes for more changes to adopt new features, which may cause disruption to current users. As a result, both teams are constrained to operate under a given budget. For SRE, this is called an error budget; only a certain number of errors can be solved until a cost cap is reached. Beyond this budget, the engineers are allowed to focus more of their efforts towards extending the software. In this case, a proper analysis of incoming events, proper error triage and the ability to predict the cost of fixing errors are crucial. The objective of this project is to develop a decision-support platform that will be able to efficiently mine event logs and prioritize events and errors to be addressed. At a second phase, the platform will use multi-objective optimizations under the budget and time constraints to choose which events will be addressed and how to optimize the mean time to repair (MTTR) and the cost to fix errors.

b. Tasks and responsibilities

The hired student will work towards the development of a prototype tool for mining logs from cloud applications, support decisions on event prioritization and resolution, reduce mean time to repair and loss of revenue. The student will develop the theoretical foundation as well as the implementation for such mechanisms. The student will aim to publish in top-tier journals, including IEEE Transactions on Cloud Computing, IEEE Transaction on Big Data, IEEE Transactions on Knowledge and Data Engineering, ACM Transactions on Autonomous and Adaptive Systems, and conferences, such as SEAMS, ACSOS, ICSE, ICPC and others. The student will also be responsible for supervising and mentoring MSc and BSc students working on the project. The position is open for Winter, Summer or Fall 2024.

c. Required Skills

The student will be asked to demonstrate adequate understanding or expertise in the following topics through relevant courses (on undergraduate or graduate level) or through relevant publications in international conferences or journals. The student should consider applying if they have the expert-level skills and at least 50% of the good-level skills.

d. Application process

Upon contacting the professor to inquire for the position, the student is also asked to submit the following documents: