Introduction
Service Level Indicators (SLIs) are quantitative measures of a service's behavior that directly reflect user experience. Implementing SLIs in Java applications provides crucial insights into reliability, performance, and availability from the user's perspective.
Architecture Overview
[User Requests] → [Java Application] → [SLI Measurement] → [Metrics Export] → [SLO Monitoring] ↓ ↓ ↓ ↓ ↓ HTTP Traffic Business Logic Latency Tracking Prometheus Alerting API Calls Database Operations Error Counting Metrics Dashboards Background Jobs External Services Availability Calc Time Series Reporting
Step 1: Project Dependencies and Configuration
Maven Configuration
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>sli-monitoring</artifactId>
<version>1.0.0</version>
<packaging>jar</packaging>
<properties>
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project-build.sourceEncoding>
<spring-boot.version>3.2.0</spring-boot.version>
<micrometer.version>1.12.0</micrometer.version>
<resilience4j.version>2.1.0</resilience4j.version>
</properties>
<dependencies>
<!-- Spring Boot -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
<version>${spring-boot.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
<version>${spring-boot.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
<version>${spring-boot.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId>
<version>${spring-boot.version}</version>
</dependency>
<!-- Metrics -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-core</artifactId>
<version>${micrometer.version}</version>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
<version>${micrometer.version}</version>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-observation</artifactId>
<version>${micrometer.version}</version>
</dependency>
<!-- Resilience4j for circuit breaker and retry -->
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
<version>${resilience4j.version}</version>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-micrometer</artifactId>
<version>${resilience4j.version}</version>
</dependency>
<!-- Database -->
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<scope>runtime</scope>
</dependency>
<!-- Cache -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-cache</artifactId>
<version>${spring-boot.version}</version>
</dependency>
<!-- JSON -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.15.0</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.datatype</groupId>
<artifactId>jackson-datatype-jsr310</artifactId>
<version>2.15.0</version>
</dependency>
<!-- Testing -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<version>${spring-boot.version}</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>
Application Configuration
application.yml
spring:
application:
name: order-service
datasource:
url: jdbc:postgresql://localhost:5432/orders
username: postgres
password: password
jpa:
hibernate:
ddl-auto: validate
show-sql: false
# SLI Configuration
sli:
enabled: true
objectives:
availability:
target: 99.9
window: 28d
latency:
p95_target: 500ms
p99_target: 1000ms
window: 28d
throughput:
target: 1000rpm
window: 7d
endpoints:
- path: "/api/orders/**"
name: "order-api"
objectives: ["availability", "latency", "throughput"]
- path: "/api/payments/**"
name: "payment-api"
objectives: ["availability", "latency"]
- path: "/api/inventory/**"
name: "inventory-api"
objectives: ["availability"]
# Management endpoints
management:
endpoints:
web:
exposure:
include: health,metrics,prometheus,info,sli
enabled-by-default: true
endpoint:
health:
show-details: always
show-components: always
metrics:
enabled: true
prometheus:
enabled: true
sli:
enabled: true
metrics:
export:
prometheus:
enabled: true
step: 30s
distribution:
percentiles-histogram:
http.server.requests: true
sli.request.duration: true
percentiles:
http.server.requests: 0.5, 0.95, 0.99, 0.999
sli.request.duration: 0.5, 0.95, 0.99, 0.999
sla:
http.server.requests: 100ms, 500ms, 1s
tags:
application: ${spring.application.name}
environment: ${ENVIRONMENT:development}
# Resilience4j Configuration
resilience4j:
circuitbreaker:
instances:
orderService:
slidingWindowSize: 100
slidingWindowType: COUNT_BASED
failureRateThreshold: 50
waitDurationInOpenState: 10s
permittedNumberOfCallsInHalfOpenState: 10
retry:
instances:
paymentService:
maxAttempts: 3
waitDuration: 500ms
server:
port: 8080
logging:
level:
com.example.sli: DEBUG
Step 2: Core SLI Models and Enums
SLI Types and Objectives
SliType.java
package com.example.sli.model;
public enum SliType {
AVAILABILITY("availability", "Percentage of successful requests"),
LATENCY("latency", "Request duration percentiles"),
THROUGHPUT("throughput", "Requests per second/minute"),
ERROR_RATE("error_rate", "Percentage of failed requests"),
SATURATION("saturation", "Resource utilization"),
CORRECTNESS("correctness", "Data accuracy and validity"),
FRESHNESS("freshness", "Data timeliness"),
COVERAGE("coverage", "Feature availability"),
CAPACITY("capacity", "System capacity utilization");
private final String code;
private final String description;
SliType(String code, String description) {
this.code = code;
this.description = description;
}
public String getCode() { return code; }
public String getDescription() { return description; }
public static SliType fromCode(String code) {
for (SliType type : values()) {
if (type.code.equals(code)) {
return type;
}
}
throw new IllegalArgumentException("Unknown SLI type: " + code);
}
}
SLO Configuration
ServiceLevelObjective.java
package com.example.sli.model;
import com.fasterxml.jackson.annotation.JsonFormat;
import java.time.Duration;
import java.time.LocalDateTime;
import java.util.Map;
public class ServiceLevelObjective {
private String name;
private String description;
private SliType sliType;
private double target; // 0.999 for 99.9%
@JsonFormat(pattern = "yyyy-MM-dd HH:mm:ss")
private LocalDateTime created;
@JsonFormat(pattern = "yyyy-MM-dd HH:mm:ss")
private LocalDateTime updated;
private Duration window; // 28d, 7d, 24h
private Map<String, String> labels;
private boolean enabled = true;
// Threshold configurations
private double warningThreshold; // 0.99 for 99%
private double criticalThreshold; // 0.98 for 98%
// For latency objectives
private Duration latencyTarget; // P95 target
private Duration latencyWarning;
private Duration latencyCritical;
// Constructors
public ServiceLevelObjective() {}
public ServiceLevelObjective(String name, SliType sliType, double target, Duration window) {
this.name = name;
this.sliType = sliType;
this.target = target;
this.window = window;
this.created = LocalDateTime.now();
this.updated = LocalDateTime.now();
}
// Getters and Setters
public String getName() { return name; }
public void setName(String name) { this.name = name; }
public String getDescription() { return description; }
public void setDescription(String description) { this.description = description; }
public SliType getSliType() { return sliType; }
public void setSliType(SliType sliType) { this.sliType = sliType; }
public double getTarget() { return target; }
public void setTarget(double target) { this.target = target; }
public LocalDateTime getCreated() { return created; }
public void setCreated(LocalDateTime created) { this.created = created; }
public LocalDateTime getUpdated() { return updated; }
public void setUpdated(LocalDateTime updated) { this.updated = updated; }
public Duration getWindow() { return window; }
public void setWindow(Duration window) { this.window = window; }
public Map<String, String> getLabels() { return labels; }
public void setLabels(Map<String, String> labels) { this.labels = labels; }
public boolean isEnabled() { return enabled; }
public void setEnabled(boolean enabled) { this.enabled = enabled; }
public double getWarningThreshold() { return warningThreshold; }
public void setWarningThreshold(double warningThreshold) { this.warningThreshold = warningThreshold; }
public double getCriticalThreshold() { return criticalThreshold; }
public void setCriticalThreshold(double criticalThreshold) { this.criticalThreshold = criticalThreshold; }
public Duration getLatencyTarget() { return latencyTarget; }
public void setLatencyTarget(Duration latencyTarget) { this.latencyTarget = latencyTarget; }
public Duration getLatencyWarning() { return latencyWarning; }
public void setLatencyWarning(Duration latencyWarning) { this.latencyWarning = latencyWarning; }
public Duration getLatencyCritical() { return latencyCritical; }
public void setLatencyCritical(Duration latencyCritical) { this.latencyCritical = latencyCritical; }
// Utility methods
public double getTargetPercentage() {
return target * 100;
}
public String getWindowDescription() {
if (window.toDays() > 0) {
return window.toDays() + " days";
} else if (window.toHours() > 0) {
return window.toHours() + " hours";
} else {
return window.toMinutes() + " minutes";
}
}
public boolean isWithinTarget(double actualValue) {
return actualValue >= target;
}
public SloStatus calculateStatus(double actualValue) {
if (actualValue >= target) {
return SloStatus.MEETING;
} else if (actualValue >= warningThreshold) {
return SloStatus.WARNING;
} else if (actualValue >= criticalThreshold) {
return SloStatus.CRITICAL;
} else {
return SloStatus.BREACHED;
}
}
}
SLI Measurement Result
SliMeasurement.java
package com.example.sli.model;
import com.fasterxml.jackson.annotation.JsonFormat;
import java.time.LocalDateTime;
import java.util.Map;
public class SliMeasurement {
private String sliName;
private SliType sliType;
private double value;
private double target;
@JsonFormat(pattern = "yyyy-MM-dd HH:mm:ss")
private LocalDateTime timestamp;
private Duration window;
private Map<String, String> labels;
private SloStatus status;
private String unit;
private String description;
// For latency measurements
private Double p50;
private Double p95;
private Double p99;
private Double p999;
// For throughput measurements
private Double requestsPerSecond;
private Long totalRequests;
// For error budget
private Double errorBudgetRemaining;
private Double errorBudgetConsumed;
private Double errorBudgetPercentage;
// Constructors
public SliMeasurement() {}
public SliMeasurement(String sliName, SliType sliType, double value, double target,
Duration window, Map<String, String> labels) {
this.sliName = sliName;
this.sliType = sliType;
this.value = value;
this.target = target;
this.window = window;
this.labels = labels;
this.timestamp = LocalDateTime.now();
this.status = value >= target ? SloStatus.MEETING : SloStatus.BREACHED;
}
// Getters and Setters
public String getSliName() { return sliName; }
public void setSliName(String sliName) { this.sliName = sliName; }
public SliType getSliType() { return sliType; }
public void setSliType(SliType sliType) { this.sliType = sliType; }
public double getValue() { return value; }
public void setValue(double value) { this.value = value; }
public double getTarget() { return target; }
public void setTarget(double target) { this.target = target; }
public LocalDateTime getTimestamp() { return timestamp; }
public void setTimestamp(LocalDateTime timestamp) { this.timestamp = timestamp; }
public Duration getWindow() { return window; }
public void setWindow(Duration window) { this.window = window; }
public Map<String, String> getLabels() { return labels; }
public void setLabels(Map<String, String> labels) { this.labels = labels; }
public SloStatus getStatus() { return status; }
public void setStatus(SloStatus status) { this.status = status; }
public String getUnit() { return unit; }
public void setUnit(String unit) { this.unit = unit; }
public String getDescription() { return description; }
public void setDescription(String description) { this.description = description; }
public Double getP50() { return p50; }
public void setP50(Double p50) { this.p50 = p50; }
public Double getP95() { return p95; }
public void setP95(Double p95) { this.p95 = p95; }
public Double getP99() { return p99; }
public void setP99(Double p99) { this.p99 = p99; }
public Double getP999() { return p999; }
public void setP999(Double p999) { this.p999 = p999; }
public Double getRequestsPerSecond() { return requestsPerSecond; }
public void setRequestsPerSecond(Double requestsPerSecond) { this.requestsPerSecond = requestsPerSecond; }
public Long getTotalRequests() { return totalRequests; }
public void setTotalRequests(Long totalRequests) { this.totalRequests = totalRequests; }
public Double getErrorBudgetRemaining() { return errorBudgetRemaining; }
public void setErrorBudgetRemaining(Double errorBudgetRemaining) { this.errorBudgetRemaining = errorBudgetRemaining; }
public Double getErrorBudgetConsumed() { return errorBudgetConsumed; }
public void setErrorBudgetConsumed(Double errorBudgetConsumed) { this.errorBudgetConsumed = errorBudgetConsumed; }
public Double getErrorBudgetPercentage() { return errorBudgetPercentage; }
public void setErrorBudgetPercentage(Double errorBudgetPercentage) { this.errorBudgetPercentage = errorBudgetPercentage; }
// Utility methods
public double getValuePercentage() {
return value * 100;
}
public double getTargetPercentage() {
return target * 100;
}
public boolean isBreaching() {
return status == SloStatus.BREACHED || status == SloStatus.CRITICAL;
}
public String getFormattedValue() {
switch (sliType) {
case AVAILABILITY:
case ERROR_RATE:
return String.format("%.3f%%", value * 100);
case LATENCY:
return String.format("%.2f ms", value * 1000);
case THROUGHPUT:
return String.format("%.1f req/s", value);
default:
return String.format("%.3f", value);
}
}
}
SLO Status Enum
SloStatus.java
package com.example.sli.model;
public enum SloStatus {
MEETING("MEETING", "All SLOs are being met", "success"),
WARNING("WARNING", "SLOs are at warning levels", "warning"),
CRITICAL("CRITICAL", "SLOs are at critical levels", "error"),
BREACHED("BREACHED", "SLOs have been breached", "error"),
UNKNOWN("UNKNOWN", "SLO status cannot be determined", "unknown");
private final String code;
private final String description;
private final String severity;
SloStatus(String code, String description, String severity) {
this.code = code;
this.description = description;
this.severity = severity;
}
public String getCode() { return code; }
public String getDescription() { return description; }
public String getSeverity() { return severity; }
public boolean isHealthy() {
return this == MEETING || this == WARNING;
}
public boolean requiresAttention() {
return this == CRITICAL || this == BREACHED;
}
}
Step 3: Core SLI Measurement Framework
SLI Registry and Manager
SliRegistry.java
package com.example.sli.registry;
import com.example.sli.model.ServiceLevelObjective;
import com.example.sli.model.SliType;
import org.springframework.stereotype.Component;
import java.time.Duration;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.CopyOnWriteArrayList;
@Component
public class SliRegistry {
private final Map<String, ServiceLevelObjective> objectives;
private final List<ServiceLevelObjective> globalObjectives;
public SliRegistry() {
this.objectives = new ConcurrentHashMap<>();
this.globalObjectives = new CopyOnWriteArrayList<>();
initializeDefaultObjectives();
}
private void initializeDefaultObjectives() {
// Global availability objective
ServiceLevelObjective availability = new ServiceLevelObjective(
"global-availability",
SliType.AVAILABILITY,
0.999, // 99.9%
Duration.ofDays(28)
);
availability.setWarningThreshold(0.995); // 99.5%
availability.setCriticalThreshold(0.99); // 99%
availability.setDescription("Global service availability target");
// Global latency objective
ServiceLevelObjective latency = new ServiceLevelObjective(
"global-latency-p95",
SliType.LATENCY,
0.5, // 500ms in seconds
Duration.ofDays(28)
);
latency.setLatencyTarget(Duration.ofMillis(500));
latency.setLatencyWarning(Duration.ofMillis(750));
latency.setLatencyCritical(Duration.ofMillis(1000));
latency.setDescription("95th percentile response time target");
registerGlobalObjective(availability);
registerGlobalObjective(latency);
}
public void registerObjective(ServiceLevelObjective objective) {
objectives.put(objective.getName(), objective);
}
public void registerGlobalObjective(ServiceLevelObjective objective) {
globalObjectives.add(objective);
registerObjective(objective);
}
public ServiceLevelObjective getObjective(String name) {
return objectives.get(name);
}
public List<ServiceLevelObjective> getAllObjectives() {
return List.copyOf(objectives.values());
}
public List<ServiceLevelObjective> getGlobalObjectives() {
return List.copyOf(globalObjectives);
}
public List<ServiceLevelObjective> getObjectivesByType(SliType type) {
return objectives.values().stream()
.filter(obj -> obj.getSliType() == type)
.filter(ServiceLevelObjective::isEnabled)
.toList();
}
public boolean removeObjective(String name) {
ServiceLevelObjective removed = objectives.remove(name);
if (removed != null) {
globalObjectives.remove(removed);
return true;
}
return false;
}
public void updateObjective(ServiceLevelObjective objective) {
objective.setUpdated(java.time.LocalDateTime.now());
objectives.put(objective.getName(), objective);
}
}
SLI Measurement Service
SliMeasurementService.java
package com.example.sli.service;
import com.example.sli.model.ServiceLevelObjective;
import com.example.sli.model.SliMeasurement;
import com.example.sli.model.SliType;
import com.example.sli.model.SloStatus;
import com.example.sli.registry.SliRegistry;
import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Service;
import java.time.Duration;
import java.time.LocalDateTime;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;
@Service
public class SliMeasurementService {
private static final Logger logger = LoggerFactory.getLogger(SliMeasurementService.class);
private final SliRegistry sliRegistry;
private final MeterRegistry meterRegistry;
// In-memory storage for SLI measurements (in production, use time series database)
private final Map<String, List<SliMeasurement>> measurementStore;
private final Map<String, AtomicLong> successCounters;
private final Map<String, AtomicLong> failureCounters;
private final Map<String, Timer> latencyTimers;
public SliMeasurementService(SliRegistry sliRegistry, MeterRegistry meterRegistry) {
this.sliRegistry = sliRegistry;
this.meterRegistry = meterRegistry;
this.measurementStore = new ConcurrentHashMap<>();
this.successCounters = new ConcurrentHashMap<>();
this.failureCounters = new ConcurrentHashMap<>();
this.latencyTimers = new ConcurrentHashMap<>();
initializeMetrics();
}
private void initializeMetrics() {
// Initialize metrics for all registered objectives
sliRegistry.getAllObjectives().forEach(this::initializeObjectiveMetrics);
}
private void initializeObjectiveMetrics(ServiceLevelObjective objective) {
String sliName = objective.getName();
// Initialize counters
successCounters.put(sliName, meterRegistry.counter("sli.success",
"sli_name", sliName,
"sli_type", objective.getSliType().getCode()));
failureCounters.put(sliName, meterRegistry.counter("sli.failure",
"sli_name", sliName,
"sli_type", objective.getSliType().getCode()));
// Initialize timer for latency objectives
if (objective.getSliType() == SliType.LATENCY) {
Timer timer = Timer.builder("sli.latency")
.description("SLI latency measurements")
.tags("sli_name", sliName)
.publishPercentiles(0.5, 0.95, 0.99, 0.999)
.publishPercentileHistogram()
.register(meterRegistry);
latencyTimers.put(sliName, timer);
}
// Initialize measurement store
measurementStore.put(sliName, new ArrayList<>());
logger.info("Initialized metrics for SLI: {}", sliName);
}
public void recordSuccess(String sliName, Map<String, String> labels) {
ServiceLevelObjective objective = sliRegistry.getObjective(sliName);
if (objective == null || !objective.isEnabled()) {
logger.warn("SLI objective not found or disabled: {}", sliName);
return;
}
AtomicLong counter = successCounters.get(sliName);
if (counter != null) {
counter.increment();
}
// Record additional labels if provided
if (labels != null && !labels.isEmpty()) {
labels.forEach((key, value) -> {
meterRegistry.counter("sli.success.detailed",
"sli_name", sliName,
"label_" + key, value).increment();
});
}
logger.debug("Recorded success for SLI: {}", sliName);
}
public void recordFailure(String sliName, String errorType, Map<String, String> labels) {
ServiceLevelObjective objective = sliRegistry.getObjective(sliName);
if (objective == null || !objective.isEnabled()) {
logger.warn("SLI objective not found or disabled: {}", sliName);
return;
}
AtomicLong counter = failureCounters.get(sliName);
if (counter != null) {
counter.increment();
}
// Record failure with error type
meterRegistry.counter("sli.failure.detailed",
"sli_name", sliName,
"error_type", errorType).increment();
// Record additional labels if provided
if (labels != null && !labels.isEmpty()) {
labels.forEach((key, value) -> {
meterRegistry.counter("sli.failure.detailed",
"sli_name", sliName,
"error_type", errorType,
"label_" + key, value).increment();
});
}
logger.debug("Recorded failure for SLI: {} - Error: {}", sliName, errorType);
}
public void recordLatency(String sliName, long duration, TimeUnit unit, Map<String, String> labels) {
ServiceLevelObjective objective = sliRegistry.getObjective(sliName);
if (objective == null || !objective.isEnabled()) {
logger.warn("SLI objective not found or disabled: {}", sliName);
return;
}
Timer timer = latencyTimers.get(sliName);
if (timer != null) {
timer.record(duration, unit);
}
// Record detailed latency with labels
if (labels != null && !labels.isEmpty()) {
Timer detailedTimer = Timer.builder("sli.latency.detailed")
.tags("sli_name", sliName)
.tags(labels)
.register(meterRegistry);
detailedTimer.record(duration, unit);
}
logger.debug("Recorded latency for SLI: {} - {} {}", sliName, duration, unit);
}
public SliMeasurement calculateCurrentSli(String sliName, Duration window) {
ServiceLevelObjective objective = sliRegistry.getObjective(sliName);
if (objective == null) {
throw new IllegalArgumentException("SLI objective not found: " + sliName);
}
return calculateSliMeasurement(objective, window);
}
private SliMeasurement calculateSliMeasurement(ServiceLevelObjective objective, Duration window) {
String sliName = objective.getName();
SliType sliType = objective.getSliType();
switch (sliType) {
case AVAILABILITY:
return calculateAvailability(sliName, objective, window);
case LATENCY:
return calculateLatency(sliName, objective, window);
case THROUGHPUT:
return calculateThroughput(sliName, objective, window);
case ERROR_RATE:
return calculateErrorRate(sliName, objective, window);
default:
throw new UnsupportedOperationException("SLI type not supported: " + sliType);
}
}
private SliMeasurement calculateAvailability(String sliName, ServiceLevelObjective objective, Duration window) {
long successes = getSuccessCount(sliName, window);
long failures = getFailureCount(sliName, window);
long total = successes + failures;
double availability = total > 0 ? (double) successes / total : 1.0;
SliMeasurement measurement = new SliMeasurement(
sliName, SliType.AVAILABILITY, availability, objective.getTarget(), window, null);
measurement.setTotalRequests(total);
measurement.setErrorBudgetRemaining(calculateErrorBudget(availability, objective.getTarget(), total));
measurement.setErrorBudgetConsumed(1.0 - measurement.getErrorBudgetRemaining());
measurement.setErrorBudgetPercentage(measurement.getErrorBudgetRemaining() * 100);
measurement.setStatus(objective.calculateStatus(availability));
measurement.setUnit("percent");
measurement.setDescription("Request availability over " + window.toMinutes() + " minutes");
return measurement;
}
private SliMeasurement calculateLatency(String sliName, ServiceLevelObjective objective, Duration window) {
// In production, this would query your time series database
// For demonstration, we'll use simulated values
Timer timer = latencyTimers.get(sliName);
double p95Latency = 0.2; // 200ms - would come from actual measurements
SliMeasurement measurement = new SliMeasurement(
sliName, SliType.LATENCY, p95Latency, objective.getTarget(), window, null);
// Set percentiles (simulated)
measurement.setP50(0.1); // 100ms
measurement.setP95(p95Latency);
measurement.setP99(0.5); // 500ms
measurement.setP999(1.0); // 1000ms
measurement.setStatus(objective.calculateStatus(p95Latency));
measurement.setUnit("seconds");
measurement.setDescription("95th percentile latency over " + window.toMinutes() + " minutes");
return measurement;
}
private SliMeasurement calculateThroughput(String sliName, ServiceLevelObjective objective, Duration window) {
long successes = getSuccessCount(sliName, window);
long failures = getFailureCount(sliName, window);
long total = successes + failures;
double throughput = total > 0 ? (double) total / window.getSeconds() : 0.0;
SliMeasurement measurement = new SliMeasurement(
sliName, SliType.THROUGHPUT, throughput, objective.getTarget(), window, null);
measurement.setRequestsPerSecond(throughput);
measurement.setTotalRequests(total);
measurement.setStatus(objective.calculateStatus(throughput));
measurement.setUnit("requests/second");
measurement.setDescription("Request throughput over " + window.toMinutes() + " minutes");
return measurement;
}
private SliMeasurement calculateErrorRate(String sliName, ServiceLevelObjective objective, Duration window) {
long successes = getSuccessCount(sliName, window);
long failures = getFailureCount(sliName, window);
long total = successes + failures;
double errorRate = total > 0 ? (double) failures / total : 0.0;
SliMeasurement measurement = new SliMeasurement(
sliName, SliType.ERROR_RATE, errorRate, objective.getTarget(), window, null);
measurement.setTotalRequests(total);
measurement.setStatus(objective.calculateStatus(errorRate));
measurement.setUnit("percent");
measurement.setDescription("Error rate over " + window.toMinutes() + " minutes");
return measurement;
}
private long getSuccessCount(String sliName, Duration window) {
// In production, query time series database for the window
AtomicLong counter = successCounters.get(sliName);
return counter != null ? counter.count() : 0;
}
private long getFailureCount(String sliName, Duration window) {
// In production, query time series database for the window
AtomicLong counter = failureCounters.get(sliName);
return counter != null ? counter.count() : 0;
}
private double calculateErrorBudget(double actual, double target, long totalRequests) {
if (totalRequests == 0) return 1.0;
double allowedErrors = (1 - target) * totalRequests;
double actualErrors = (1 - actual) * totalRequests;
double remainingBudget = Math.max(0, allowedErrors - actualErrors);
return remainingBudget / allowedErrors;
}
public List<SliMeasurement> getAllCurrentMeasurements(Duration window) {
return sliRegistry.getAllObjectives().stream()
.filter(ServiceLevelObjective::isEnabled)
.map(objective -> calculateSliMeasurement(objective, window))
.toList();
}
public void storeMeasurement(SliMeasurement measurement) {
String sliName = measurement.getSliName();
List<SliMeasurement> measurements = measurementStore.computeIfAbsent(sliName, k -> new ArrayList<>());
measurements.add(measurement);
// Keep only last 1000 measurements per SLI
if (measurements.size() > 1000) {
measurements.remove(0);
}
}
public List<SliMeasurement> getMeasurementHistory(String sliName, Duration period) {
List<SliMeasurement> measurements = measurementStore.get(sliName);
if (measurements == null) {
return List.of();
}
LocalDateTime cutoff = LocalDateTime.now().minus(period);
return measurements.stream()
.filter(m -> m.getTimestamp().isAfter(cutoff))
.toList();
}
}
Step 4: Aspect-Oriented SLI Measurement
SLI Measurement Aspect
SliMeasurementAspect.java
package com.example.sli.aspect;
import com.example.sli.model.SliType;
import com.example.sli.service.SliMeasurementService;
import io.micrometer.core.instrument.Timer;
import org.aspectj.lang.ProceedingJoinPoint;
import org.aspectj.lang.annotation.Around;
import org.aspectj.lang.annotation.Aspect;
import org.aspectj.lang.reflect.MethodSignature;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Component;
import java.lang.reflect.Method;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.TimeUnit;
@Aspect
@Component
public class SliMeasurementAspect {
private static final Logger logger = LoggerFactory.getLogger(SliMeasurementAspect.class);
private final SliMeasurementService sliMeasurementService;
public SliMeasurementAspect(SliMeasurementService sliMeasurementService) {
this.sliMeasurementService = sliMeasurementService;
}
@Around("@annotation(MeasureAvailability)")
public Object measureAvailability(ProceedingJoinPoint joinPoint) throws Throwable {
MeasureAvailability annotation = getAnnotation(joinPoint, MeasureAvailability.class);
String sliName = annotation.value();
Map<String, String> labels = extractLabels(joinPoint, annotation.labels());
return measureOperation(joinPoint, sliName, SliType.AVAILABILITY, labels);
}
@Around("@annotation(MeasureLatency)")
public Object measureLatency(ProceedingJoinPoint joinPoint) throws Throwable {
MeasureLatency annotation = getAnnotation(joinPoint, MeasureLatency.class);
String sliName = annotation.value();
Map<String, String> labels = extractLabels(joinPoint, annotation.labels());
return measureOperation(joinPoint, sliName, SliType.LATENCY, labels);
}
@Around("@annotation(MeasureThroughput)")
public Object measureThroughput(ProceedingJoinPoint joinPoint) throws Throwable {
MeasureThroughput annotation = getAnnotation(joinPoint, MeasureThroughput.class);
String sliName = annotation.value();
Map<String, String> labels = extractLabels(joinPoint, annotation.labels());
return measureOperation(joinPoint, sliName, SliType.THROUGHPUT, labels);
}
private Object measureOperation(ProceedingJoinPoint joinPoint, String sliName,
SliType sliType, Map<String, String> labels) throws Throwable {
long startTime = System.nanoTime();
boolean success = false;
try {
Object result = joinPoint.proceed();
success = true;
return result;
} catch (Exception e) {
recordFailure(sliName, sliType, e.getClass().getSimpleName(), labels);
throw e;
} finally {
long duration = System.nanoTime() - startTime;
if (success) {
recordSuccess(sliName, sliType, labels);
}
if (sliType == SliType.LATENCY) {
recordLatency(sliName, duration, TimeUnit.NANOSECONDS, labels);
}
logger.debug("Measured {} operation: {} - Success: {}, Duration: {} ns",
sliType, sliName, success, duration);
}
}
private void recordSuccess(String sliName, SliType sliType, Map<String, String> labels) {
try {
sliMeasurementService.recordSuccess(sliName, labels);
} catch (Exception e) {
logger.warn("Failed to record success for SLI: {}", sliName, e);
}
}
private void recordFailure(String sliName, SliType sliType, String errorType, Map<String, String> labels) {
try {
Map<String, String> enhancedLabels = new HashMap<>(labels);
enhancedLabels.put("error_type", errorType);
sliMeasurementService.recordFailure(sliName, errorType, enhancedLabels);
} catch (Exception e) {
logger.warn("Failed to record failure for SLI: {}", sliName, e);
}
}
private void recordLatency(String sliName, long duration, TimeUnit unit, Map<String, String> labels) {
try {
sliMeasurementService.recordLatency(sliName, duration, unit, labels);
} catch (Exception e) {
logger.warn("Failed to record latency for SLI: {}", sliName, e);
}
}
private <T extends java.lang.annotation.Annotation> T getAnnotation(ProceedingJoinPoint joinPoint, Class<T> annotationClass) {
MethodSignature signature = (MethodSignature) joinPoint.getSignature();
Method method = signature.getMethod();
return method.getAnnotation(annotationClass);
}
private Map<String, String> extractLabels(ProceedingJoinPoint joinPoint, String[] labelDefinitions) {
Map<String, String> labels = new HashMap<>();
for (String labelDef : labelDefinitions) {
String[] parts = labelDef.split("=", 2);
if (parts.length == 2) {
labels.put(parts[0].trim(), parts[1].trim());
}
}
// Add method name as label
MethodSignature signature = (MethodSignature) joinPoint.getSignature();
labels.put("method", signature.getMethod().getName());
labels.put("class", signature.getDeclaringType().getSimpleName());
return labels;
}
}
SLI Measurement Annotations
MeasureAvailability.java
package com.example.sli.aspect;
import java.lang.annotation.ElementType;
import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.lang.annotation.Target;
@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface MeasureAvailability {
String value();
String[] labels() default {};
}
MeasureLatency.java
package com.example.sli.aspect;
import java.lang.annotation.ElementType;
import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.lang.annotation.Target;
@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface MeasureLatency {
String value();
String[] labels() default {};
}
MeasureThroughput.java
package com.example.sli.aspect;
import java.lang.annotation.ElementType;
import java.lang.annotation.Retention;
import java.lang.annotation.RetentionPolicy;
import java.lang.annotation.Target;
@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface MeasureThroughput {
String value();
String[] labels() default {};
}
Step 5: Business Service with SLI Integration
Order Service with SLI Measurements
OrderService.java
package com.example.sli.service;
import com.example.sli.aspect.MeasureAvailability;
import com.example.sli.aspect.MeasureLatency;
import com.example.sli.aspect.MeasureThroughput;
import com.example.sli.model.ServiceLevelObjective;
import com.example.sli.model.SliType;
import com.example.sli.registry.SliRegistry;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Service;
import java.time.Duration;
import java.util.Map;
import java.util.Random;
import java.util.concurrent.CompletableFuture;
@Service
public class OrderService {
private static final Logger logger = LoggerFactory.getLogger(OrderService.class);
private final SliRegistry sliRegistry;
private final SliMeasurementService sliMeasurementService;
private final Random random = new Random();
public OrderService(SliRegistry sliRegistry, SliMeasurementService sliMeasurementService) {
this.sliRegistry = sliRegistry;
this.sliMeasurementService = sliMeasurementService;
initializeOrderSLOs();
}
private void initializeOrderSLOs() {
// Order processing availability
ServiceLevelObjective orderAvailability = new ServiceLevelObjective(
"order-processing-availability",
SliType.AVAILABILITY,
0.995, // 99.5%
Duration.ofDays(28)
);
orderAvailability.setDescription("Order processing service availability");
orderAvailability.setLabels(Map.of("service", "order", "component", "processing"));
// Order processing latency
ServiceLevelObjective orderLatency = new ServiceLevelObjective(
"order-processing-latency-p95",
SliType.LATENCY,
2.0, // 2 seconds
Duration.ofDays(28)
);
orderLatency.setDescription("95th percentile order processing latency");
orderLatency.setLabels(Map.of("service", "order", "component", "processing"));
// Order creation throughput
ServiceLevelObjective orderThroughput = new ServiceLevelObjective(
"order-creation-throughput",
SliType.THROUGHPUT,
50.0, // 50 orders per second
Duration.ofDays(7)
);
orderThroughput.setDescription("Order creation throughput");
orderThroughput.setLabels(Map.of("service", "order", "component", "creation"));
sliRegistry.registerObjective(orderAvailability);
sliRegistry.registerObjective(orderLatency);
sliRegistry.registerObjective(orderThroughput);
}
@MeasureAvailability("order-processing-availability")
@MeasureLatency("order-processing-latency-p95")
@MeasureThroughput("order-creation-throughput")
public Order processOrder(OrderRequest request) {
simulateProcessing(100, 500); // Simulate processing time
// Simulate occasional failures
if (random.nextDouble() < 0.02) { // 2% failure rate
throw new OrderProcessingException("Simulated order processing failure");
}
Order order = createOrder(request);
logger.info("Order processed successfully: {}", order.getId());
return order;
}
@MeasureAvailability("order-retrieval-availability")
@MeasureLatency("order-retrieval-latency-p95")
public Order getOrder(String orderId) {
simulateProcessing(50, 200);
if (random.nextDouble() < 0.01) { // 1% failure rate
throw new OrderNotFoundException("Order not found: " + orderId);
}
return findOrder(orderId);
}
@MeasureAvailability("order-cancellation-availability")
@MeasureLatency("order-cancellation-latency-p95")
public boolean cancelOrder(String orderId, String reason) {
simulateProcessing(80, 300);
if (random.nextDouble() < 0.05) { // 5% failure rate
throw new OrderCancellationException("Order cancellation failed: " + orderId);
}
return performCancellation(orderId, reason);
}
@MeasureAvailability("order-bulk-processing-availability")
@MeasureLatency("order-bulk-processing-latency-p95")
@MeasureThroughput("order-bulk-processing-throughput")
public BulkOrderResult processBulkOrders(List<OrderRequest> requests) {
simulateProcessing(1000, 5000);
int successCount = 0;
int failureCount = 0;
List<Order> successfulOrders = new ArrayList<>();
List<OrderFailure> failedOrders = new ArrayList<>();
for (OrderRequest request : requests) {
try {
Order order = processOrder(request);
successfulOrders.add(order);
successCount++;
} catch (Exception e) {
failedOrders.add(new OrderFailure(request, e.getMessage()));
failureCount++;
}
}
return new BulkOrderResult(successfulOrders, failedOrders, successCount, failureCount);
}
public CompletableFuture<Order> processOrderAsync(OrderRequest request) {
return CompletableFuture.supplyAsync(() -> {
// Record custom SLI measurement for async operations
long startTime = System.nanoTime();
boolean success = false;
try {
Order order = processOrder(request);
success = true;
return order;
} finally {
long duration = System.nanoTime() - startTime;
Map<String, String> labels = Map.of(
"operation", "async_order_processing",
"order_type", request.getType()
);
if (success) {
sliMeasurementService.recordSuccess("order-async-processing", labels);
} else {
sliMeasurementService.recordFailure("order-async-processing", "async_error", labels);
}
sliMeasurementService.recordLatency("order-async-processing", duration,
java.util.concurrent.TimeUnit.NANOSECONDS, labels);
}
});
}
// Helper methods with simulated implementations
private void simulateProcessing(int minMs, int maxMs) {
try {
int delay = minMs + random.nextInt(maxMs - minMs);
Thread.sleep(delay);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException("Processing interrupted", e);
}
}
private Order createOrder(OrderRequest request) {
return new Order(
"ORD-" + System.currentTimeMillis() + "-" + random.nextInt(1000),
request,
OrderStatus.CREATED
);
}
private Order findOrder(String orderId) {
return new Order(orderId, new OrderRequest("standard", "basic", 100.0, 0.0), OrderStatus.COMPLETED);
}
private boolean performCancellation(String orderId, String reason) {
return random.nextDouble() > 0.1; // 90% success rate
}
// Data classes
public static class OrderRequest {
private String type;
private String customerTier;
private double amount;
private double discount;
public OrderRequest(String type, String customerTier, double amount, double discount) {
this.type = type;
this.customerTier = customerTier;
this.amount = amount;
this.discount = discount;
}
public String getType() { return type; }
public String getCustomerTier() { return customerTier; }
}
public static class Order {
private String id;
private OrderRequest request;
private OrderStatus status;
public Order(String id, OrderRequest request, OrderStatus status) {
this.id = id;
this.request = request;
this.status = status;
}
public String getId() { return id; }
}
public static class BulkOrderResult {
private List<Order> successfulOrders;
private List<OrderFailure> failedOrders;
private int successCount;
private int failureCount;
public BulkOrderResult(List<Order> successfulOrders, List<OrderFailure> failedOrders,
int successCount, int failureCount) {
this.successfulOrders = successfulOrders;
this.failedOrders = failedOrders;
this.successCount = successCount;
this.failureCount = failureCount;
}
}
public static class OrderFailure {
private OrderRequest request;
private String error;
public OrderFailure(OrderRequest request, String error) {
this.request = request;
this.error = error;
}
}
public enum OrderStatus {
CREATED, PROCESSING, COMPLETED, CANCELLED, FAILED
}
// Exception classes
public static class OrderProcessingException extends RuntimeException {
public OrderProcessingException(String message) { super(message); }
}
public static class OrderNotFoundException extends RuntimeException {
public OrderNotFoundException(String message) { super(message); }
}
public static class OrderCancellationException extends RuntimeException {
public OrderCancellationException(String message) { super(message); }
}
}
Step 6: REST Controller with SLI Integration
Order Controller with SLI Measurements
OrderController.java
package com.example.sli.controller;
import com.example.sli.model.SliMeasurement;
import com.example.sli.service.OrderService;
import com.example.sli.service.SliMeasurementService;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
import java.time.Duration;
import java.util.List;
import java.util.Map;
@RestController
@RequestMapping("/api/orders")
public class OrderController {
private final OrderService orderService;
private final SliMeasurementService sliMeasurementService;
public OrderController(OrderService orderService, SliMeasurementService sliMeasurementService) {
this.orderService = orderService;
this.sliMeasurementService = sliMeasurementService;
}
@PostMapping
public ResponseEntity<?> createOrder(@RequestBody OrderService.OrderRequest request) {
try {
OrderService.Order order = orderService.processOrder(request);
return ResponseEntity.ok(order);
} catch (OrderService.OrderProcessingException e) {
return ResponseEntity.badRequest().body(Map.of("error", e.getMessage()));
}
}
@GetMapping("/{orderId}")
public ResponseEntity<?> getOrder(@PathVariable String orderId) {
try {
OrderService.Order order = orderService.getOrder(orderId);
return ResponseEntity.ok(order);
} catch (OrderService.OrderNotFoundException e) {
return ResponseEntity.notFound().build();
}
}
@DeleteMapping("/{orderId}")
public ResponseEntity<?> cancelOrder(@PathVariable String orderId,
@RequestParam String reason) {
try {
boolean success = orderService.cancelOrder(orderId, reason);
return ResponseEntity.ok(Map.of("cancelled", success));
} catch (OrderService.OrderCancellationException e) {
return ResponseEntity.badRequest().body(Map.of("error", e.getMessage()));
}
}
@PostMapping("/bulk")
public ResponseEntity<?> createBulkOrders(@RequestBody List<OrderService.OrderRequest> requests) {
try {
OrderService.BulkOrderResult result = orderService.processBulkOrders(requests);
return ResponseEntity.ok(result);
} catch (Exception e) {
return ResponseEntity.badRequest().body(Map.of("error", e.getMessage()));
}
}
@PostMapping("/async")
public ResponseEntity<?> createOrderAsync(@RequestBody OrderService.OrderRequest request) {
try {
var future = orderService.processOrderAsync(request);
return ResponseEntity.accepted().body(Map.of("message", "Order processing started asynchronously"));
} catch (Exception e) {
return ResponseEntity.badRequest().body(Map.of("error", e.getMessage()));
}
}
// SLI monitoring endpoints
@GetMapping("/sli/current")
public ResponseEntity<List<SliMeasurement>> getCurrentSliMeasurements(
@RequestParam(defaultValue = "PT1H") String window) {
Duration duration = Duration.parse(window);
List<SliMeasurement> measurements = sliMeasurementService.getAllCurrentMeasurements(duration);
return ResponseEntity.ok(measurements);
}
@GetMapping("/sli/{sliName}")
public ResponseEntity<SliMeasurement> getSliMeasurement(
@PathVariable String sliName,
@RequestParam(defaultValue = "PT1H") String window) {
Duration duration = Duration.parse(window);
SliMeasurement measurement = sliMeasurementService.calculateCurrentSli(sliName, duration);
return ResponseEntity.ok(measurement);
}
@GetMapping("/sli/{sliName}/history")
public ResponseEntity<List<SliMeasurement>> getSliHistory(
@PathVariable String sliName,
@RequestParam(defaultValue = "PT24H") String period) {
Duration duration = Duration.parse(period);
List<SliMeasurement> history = sliMeasurementService.getMeasurementHistory(sliName, duration);
return ResponseEntity.ok(history);
}
}
Step 7: SLI Actuator Endpoint
Custom Actuator Endpoint
SliEndpoint.java
package com.example.sli.actuator;
import com.example.sli.model.SliMeasurement;
import com.example.sli.model.ServiceLevelObjective;
import com.example.sli.registry.SliRegistry;
import com.example.sli.service.SliMeasurementService;
import org.springframework.boot.actuate.endpoint.annotation.ReadOperation;
import org.springframework.boot.actuate.endpoint.annotation.Selector;
import org.springframework.boot.actuate.endpoint.annotation.WriteOperation;
import org.springframework.boot.actuate.endpoint.web.annotation.WebEndpoint;
import org.springframework.stereotype.Component;
import java.time.Duration;
import java.util.List;
import java.util.Map;
@Component
@WebEndpoint(id = "sli")
public class SliEndpoint {
private final SliRegistry sliRegistry;
private final SliMeasurementService sliMeasurementService;
public SliEndpoint(SliRegistry sliRegistry, SliMeasurementService sliMeasurementService) {
this.sliRegistry = sliRegistry;
this.sliMeasurementService = sliMeasurementService;
}
@ReadOperation
public Map<String, Object> sliInfo() {
List<ServiceLevelObjective> objectives = sliRegistry.getAllObjectives();
List<ServiceLevelObjective> globalObjectives = sliRegistry.getGlobalObjectives();
Duration defaultWindow = Duration.ofHours(1);
List<SliMeasurement> currentMeasurements = sliMeasurementService.getAllCurrentMeasurements(defaultWindow);
long meetingSLOs = currentMeasurements.stream()
.filter(m -> !m.isBreaching())
.count();
return Map.of(
"objectives", objectives,
"globalObjectives", globalObjectives,
"currentMeasurements", currentMeasurements,
"summary", Map.of(
"totalObjectives", objectives.size(),
"meetingSLOs", meetingSLOs,
"breachingSLOs", objectives.size() - meetingSLOs,
"defaultWindow", defaultWindow.toString()
)
);
}
@ReadOperation
public ServiceLevelObjective getObjective(@Selector String name) {
ServiceLevelObjective objective = sliRegistry.getObjective(name);
if (objective == null) {
throw new IllegalArgumentException("SLI objective not found: " + name);
}
return objective;
}
@ReadOperation
public SliMeasurement getMeasurement(@Selector String name,
@Selector String window) {
Duration duration = Duration.parse(window);
return sliMeasurementService.calculateCurrentSli(name, duration);
}
@WriteOperation
public ServiceLevelObjective updateObjective(@Selector String name,
Map<String, Object> updates) {
ServiceLevelObjective existing = sliRegistry.getObjective(name);
if (existing == null) {
throw new IllegalArgumentException("SLI objective not found: " + name);
}
// Apply updates (in real implementation, validate and apply changes properly)
if (updates.containsKey("target")) {
existing.setTarget((Double) updates.get("target"));
}
if (updates.containsKey("enabled")) {
existing.setEnabled((Boolean) updates.get("enabled"));
}
if (updates.containsKey("description")) {
existing.setDescription((String) updates.get("description"));
}
sliRegistry.updateObjective(existing);
return existing;
}
}
Step 8: Prometheus Alert Rules for SLOs
SLO-Based Alerting Rules
slo-alerts.yml
groups:
- name: slo_alerts
rules:
# Availability SLO Alerts
- alert: OrderProcessingAvailabilityWarning
expr: |
(
rate(sli_success{sli_name="order-processing-availability"}[1h])
/
(rate(sli_success{sli_name="order-processing-availability"}[1h]) + rate(sli_failure{sli_name="order-processing-availability"}[1h]))
) < 0.995
for: 15m
labels:
severity: warning
sli_name: order-processing-availability
slo_target: "99.5%"
annotations:
summary: "Order processing availability below warning threshold"
description: "Order processing availability is {{ $value | humanizePercentage }} (target: 99.5%)"
- alert: OrderProcessingAvailabilityCritical
expr: |
(
rate(sli_success{sli_name="order-processing-availability"}[1h])
/
(rate(sli_success{sli_name="order-processing-availability"}[1h]) + rate(sli_failure{sli_name="order-processing-availability"}[1h]))
) < 0.99
for: 5m
labels:
severity: critical
sli_name: order-processing-availability
slo_target: "99.5%"
annotations:
summary: "Order processing availability critically low"
description: "Order processing availability is {{ $value | humanizePercentage }} (target: 99.5%)"
# Latency SLO Alerts
- alert: OrderProcessingLatencyWarning
expr: |
histogram_quantile(0.95, rate(sli_latency_seconds_bucket{sli_name="order-processing-latency-p95"}[5m])) > 1.5
for: 10m
labels:
severity: warning
sli_name: order-processing-latency-p95
slo_target: "2.0s p95"
annotations:
summary: "Order processing latency approaching SLO"
description: "95th percentile latency is {{ $value | humanizeDuration }} (target: 2.0s)"
- alert: OrderProcessingLatencyCritical
expr: |
histogram_quantile(0.95, rate(sli_latency_seconds_bucket{sli_name="order-processing-latency-p95"}[5m])) > 1.8
for: 5m
labels:
severity: critical
sli_name: order-processing-latency-p95
slo_target: "2.0s p95"
annotations:
summary: "Order processing latency breaching SLO"
description: "95th percentile latency is {{ $value | humanizeDuration }} (target: 2.0s)"
# Error Budget Alerts
- alert: OrderProcessingErrorBudgetWarning
expr: |
(
(1 - 0.995) * (rate(sli_success{sli_name="order-processing-availability"}[28d]) + rate(sli_failure{sli_name="order-processing-availability"}[28d])) * 28 * 24 * 3600
-
rate(sli_failure{sli_name="order-processing-availability"}[28d]) * 28 * 24 * 3600
)
/
((1 - 0.995) * (rate(sli_success{sli_name="order-processing-availability"}[28d]) + rate(sli_failure{sli_name="order-processing-availability"}[28d])) * 28 * 24 * 3600)
< 0.5
for: 1h
labels:
severity: warning
sli_name: order-processing-availability
annotations:
summary: "Order processing error budget below 50%"
description: "Error budget remaining: {{ $value | humanizePercentage }}"
- alert: OrderProcessingErrorBudgetCritical
expr: |
(
(1 - 0.995) * (rate(sli_success{sli_name="order-processing-availability"}[28d]) + rate(sli_failure{sli_name="order-processing-availability"}[28d])) * 28 * 24 * 3600
-
rate(sli_failure{sli_name="order-processing-availability"}[28d]) * 28 * 24 * 3600
)
/
((1 - 0.995) * (rate(sli_success{sli_name="order-processing-availability"}[28d]) + rate(sli_failure{sli_name="order-processing-availability"}[28d])) * 28 * 24 * 3600)
< 0.1
for: 30m
labels:
severity: critical
sli_name: order-processing-availability
annotations:
summary: "Order processing error budget below 10%"
description: "Error budget remaining: {{ $value | humanizePercentage }}"
Best Practices
1. SLI Design Principles
- User-Centric: Measure what users actually experience
- Simple and Focused: Each SLI should measure one specific aspect
- Actionable: SLIs should drive operational decisions
- Measurable: Ensure SLIs can be accurately measured
2. SLO Target Setting
- Start with conservative targets and tighten over time
- Consider business requirements and user expectations
- Account for dependencies and external factors
- Review and adjust SLOs regularly
3. Implementation Guidelines
- Use consistent naming conventions for SLIs
- Include relevant labels for dimensionality
- Implement proper error handling in measurement
- Monitor SLI measurement overhead
4. Operational Excellence
- Set up comprehensive alerting based on SLOs
- Implement error budget tracking and reporting
- Conduct regular SLO reviews with stakeholders
- Use SLIs for capacity planning and resource allocation
Conclusion
This comprehensive SLI implementation provides:
- Complete SLI Framework: Types, objectives, measurements, and calculations
- Aspect-Oriented Integration: Non-intrusive SLI measurement with annotations
- Production-Ready Metrics: Prometheus integration with comprehensive alerting
- Business Service Integration: Real-world examples with order processing
- Monitoring and Reporting: Actuator endpoints and dashboards
Key Benefits:
- User-Centric Monitoring: Focus on what matters to users
- Proactive Alerting: Alert based on SLO breaches before users are impacted
- Data-Driven Decisions: Use SLIs for prioritization and resource allocation
- Service Reliability: Continuous improvement through SLO tracking
- Stakeholder Alignment: Clear, measurable service quality targets
By implementing this solution, you can establish a robust Service Level Indicator framework that provides deep insights into your service's reliability and performance from the user's perspective.