SRE | Marios Fokaefs

Improving reliability efficiency through log mining and multi-objective optimizations

a. Project Description

Site Reliability Engineering or SRE is a set of practices, developed at Google, aiming at maintaining the quality and reliability of software systems at runtime and in production. This is potentially one of the most critical phases of a software system’s lifecycle as at this point it can directly affect the user experience and the provider’s business goals. One major challenge that reliability teams have to face is the sheer amount of events that can happen during runtime and the data that these events generate. Potential lack of processing throughput and missed events may lead to increased maintenance costs, but most importantly to lost revenue. Another important challenge is that SRE is often in constructive competition with the development and business teams; while the SRE team pushes for more stability and reliability, the development team pushes for more changes to adopt new features, which may cause disruption to current users. As a result, both teams are constrained to operate under a given budget. For SRE, this is called an error budget; only a certain number of errors can be solved until a cost cap is reached. Beyond this budget, the engineers are allowed to focus more of their efforts towards extending the software. In this case, a proper analysis of incoming events, proper error triage and the ability to predict the cost of fixing errors are crucial. The objective of this project is to develop a decision-support platform that will be able to efficiently mine event logs and prioritize events and errors to be addressed. At a second phase, the platform will use multi-objective optimizations under the budget and time constraints to choose which events will be addressed and how to optimize the mean time to repair (MTTR) and the cost to fix errors.

b. Tasks and responsibilities

The hired student will work towards the development of a prototype tool for mining logs from cloud applications, support decisions on event prioritization and resolution, reduce mean time to repair and loss of revenue. The student will develop the theoretical foundation as well as the implementation for such mechanisms. The student will aim to publish in top-tier journals, including IEEE Transactions on Cloud Computing, IEEE Transaction on Big Data, IEEE Transactions on Knowledge and Data Engineering, ACM Transactions on Autonomous and Adaptive Systems, and conferences, such as SEAMS, ACSOS, ICSE, ICPC and others. The student will also be responsible for supervising and mentoring MSc and BSc students working on the project. The position is open for Winter, Summer or Fall 2024.

c. Required Skills

The student will be asked to demonstrate adequate understanding or expertise in the following topics through relevant courses (on undergraduate or graduate level) or through relevant publications in international conferences or journals. The student should consider applying if they have the expert-level skills and at least 50% of the good-level skills.

Expert programming skills, preferably in python.
Expert knowledge on cloud computing and distributed systems.
Good knowledge in any of these optimization techniques: linear programming OR dynamic programming OR control theory.
Good knowledge on data/text mining.
Good knowledge on distributed data analytics systems, such as MapReduce.
Good knowledge on container technology, such as Docker or Kubernetes.
Adequate knowledge on machine learning models and methods.
Adequate knowledge on DevOps.
Adequate knowledge on statistical methods and tests.
Basic knowledge on finances and economics.

d. Application process

Upon contacting the professor to inquire for the position, the student is also asked to submit the following documents:

A copy of the most recent version of their CV or Resume.
A copy of the transcripts of their undergraduate and master studies.
The aforementioned documents are also required by the EECS application process for the PhD program (along with a statement of purpose). The candidate student is highly encouraged to complete the EECS application in parallel to contacting the professor. More information about the EECS application can be found here: https://lassonde.yorku.ca/eecs/academics/graduate/future-students/#phd
The names and contact information of 3 referees.
A review for one of the three following articles. The review (maximum one page) should contain a summary of the paper, its strengths and weaknesses and comments about the improvement or extension of the work presented in the paper.
- Hwang, J., Shwartz, L., Wang, Q., Batta, R., Kumar, H. and Nidd, M., 2021, May. Fixme: Enhance software reliability with hybrid approaches in cloud. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (pp. 228-237). IEEE.
- Wang, H., Wu, Z., Jiang, H., Huang, Y., Wang, J., Kopru, S. and Xie, T., 2021, November. Groot: An event-graph-based approach for root cause analysis in industrial settings. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (pp. 419-429). IEEE.
- Hao, W., Yen, I.L. and Thuraisingham, B., 2009, July. Dynamic service and data migration in the clouds. In 2009 33rd annual IEEE international computer software and applications conference (Vol. 2, pp. 134-139). IEEE.
An example of a proposal (as evidence of writing) written by the student for a research project relevant to the position or of a topic selected by the student. The proposal should include background, motivation, methodology and a plan for evaluation. The proposal should be maximum 2 pages.
The candidate student should submit these documents by email to the professor with the subject “SRE PhD 2024”. No email will be considered unless it has this subject and the required attachments (CV, transcripts, review, proposal). In the email, the student should express their interest to the position and provide the corresponding evidence to the required skills as this appears in the attached documents.

back