LAB 1: Implementing Service Level Objectives (SLO), Service Level Indicators (SLI), and Service Level Agreements (SLA) Aim : Instrument a sample web service, collect SLIs, define SLOs, and link SLAs to SLO performance Objectives Measure key SLIs: availability, error rate, latency, and throughput. Define SLOs with clear, quantifiable thresholds. Distinguish between SLIs, SLOs, and SLAs. Visualize service performance using Prometheus and Grafana. Software & Tools Required Python – to expose metrics Prometheus – for metric collection Grafana – for building dashboards Procedure Step1 : Create a web application using Python and Flask Step2 : Run Prometheus with Prometheus.yml as config file Congif file : global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: "prometheus" static_configs: - targets: - "localhost:9090" - job_name: "webapp" metrics_path: "/metrics" static_configs: - targets: - "localhost:8080" labels: app: "flask - webapp" scrape_interval: 5s scrape_timeout: 3s Step3 : Set Up Grafana Run Grafana server Connections > Click data sources > Add new Data sources Select Prometheus and enter the localhost:9000 as url Click on save and test to save Step 4 — Calculate SLIs (PromQL) Availability: 100 * sum(rate(http_requests_total{status="200"}[5m])) / sum(rate(http_requests_total[5m])) Error Rate: sum(rate(http_requests_total{status="500"}[5m])) / sum(rate(http_requests_total[5m])) Latency (p95): histogram_qua ntile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) Throughput (RPS): sum(rate(http_requests_total[1m])) Step 5 — Define SLOs Availability ≥ 99% (1h) Error Rate ≤ 2% (1h) p95 Latency ≤ 400 ms Throughput ≥ 10 RPS (normal load) Step 6 — Apply Load & Observe Run: hey - n 1000 - c 20 http://localhost:8080/ Observe: request spike, latency changes, 500 errors, SLI/SLO compliance Document graphs in the observation section Lab 2 – Chaos Engineering & System Monitoring Aim To implement system observability and monitoring by exposing CPU and memory metrics using Prometheus , visualizing them in Grafana , and preparing the system for chaos engineering experiments to test resilie nce and reliability. Observation 1. System Metrics Application A Flask - based monitoring application ( cpu_monitor.py ) was created to expose system performance metrics. The application performs the following functions: ● Collects CPU usage using psutil.cpu_ percent() ● Collects Memory usage using psutil.virtual_memory().percent ● Exposes metrics in Prometheus format at: http://localhost:8082/metrics Example output observed: system_cpu_percent 4.6 system_memory_percent 61.3 This confirms that the application successfully generates custom Prometheus metrics 2. Prometheus Integration Prometheus was configured to scrape metrics from the monitoring service. Configuration added in prometheus.yml : job_name: 'system' targets: ['localhost:8082'] Prometheus successfully scraped the following metrics: ● system_cpu_percent ● system_memory_percent ● Python runtime metrics ● Garbage collection metrics 3. Grafana Visualization Grafana dashboard was connected to Prometheus data source and used to visualize metrics. Observed metrics in Grafana: HTTP Requests Metric http_requests_total{status="200"} Observations from dashboard: ● Successful HTTP responses (status 200) were recorded ● Requests increased gradually over time ● Bar chart showed request count increasing from 1 to 2 4. System Resource Status From the system monitoring page: CPU Usage: ~6.4% Memory Usage: ~61.2% Observations: ● CPU utilization remained low , indicating the system was idle ● Memory utilization was moderate ● No abnormal spikes were observed during monitoring 5. Metrics Endpoint Verification Accessing the endpoint: http://localhost:8082/metrics Displayed Prometheus formatted metrics including: ● Python GC statistics ● Python runtime info ● Custom CPU and memory metrics This confirms the monitoring system is successfully exporting metrics to Prometheus Result The experiment was successfully implemented and verified. ● A Flask monitoring service was created to expose system metrics. ● Prometheus successfully scraped the metrics from the monitoring endpoint. ● Grafana dashboard visualized system performance metrics in real time. ● CPU and memory usage were successfully monitored. ● The system is now ready for chaos engineering experiments to observe system behavior under stress conditions. Thus, the objective of implementing observability and monitoring infrastructure fo r resilience testing was achieved successfully. OUTPUT : - LAB 3: Mapping SRE Practices with Production Environments 1. AIM To simulate a production - like microservices environment, implement comprehensive monitoring, map service dependencies, and establish SLO - driven alerting based on SRE best practices. 2. OBJECTIVES • Set up a multi - tier microservices architecture • Implement Prometheus monitoring across all services • Create service dependency maps • Define service - level objectives (SLOs) for each dependency • Configure production - grade alerting rules • Build an integrated Grafana dashboard showing service health • Practice incident response in a simulated production environment 3. SOFTWARE & TOOLS REQUIRED Tool Purpose Python Application services Flask Web framework Prometheus Metrics collection Grafana Visualization Requests HTTP client psutil System metrics 4. LAB ARCHITECTURE SETUP 5. PROMETHEUS CONFIGURATION 6. ALERTS YML CONFIG 7. STARTING THE PROMETHEUS 8. STARTING THE NEWS PORTAL AND OTHER SERVICES 9. VERIFYING ALL THE ENDPOINTS ARE WORKING 10. FULL GRAFANA DASHBOARD