Assist in implementing and maintaining monitoring and alerting systems using native cloud monitoring services (e.g., AWS CloudWatch, Azure Monitor, Google Cloud Operations).
Respond and remediate operational issues impacting performance or availability.
Configure Prometheus to scrape and store metrics from cloud resources.
Develop dashboards and alerts in Grafana to enable proactive incident detection.
Collaborate with teams to identify key reliability and performance indicators.
Support incident response by tuning alert thresholds and helping diagnose alerts.
Participate in documenting monitoring procedures and best practices.
Learn and apply SRE principles focused on reliability, scalability, and automation.
Qualification & Experience
Currently pursuing or recently completed a degree in Computer Science, Engineering, or related field.
Basic understanding of cloud computing concepts (AWS, Azure, or GCP).
Familiarity with monitoring and observability tools like Prometheus and Grafana is a plus.
Knowledge of scripting or programming languages (e.g., Python, Bash) is desirable.
Strong problem-solving skills and eagerness to learn.
Good communication skills and ability to work collaboratively.