Incident Response Runbooks in Java: Building Automated Incident Management Systems

Incident response runbooks are predefined procedures for detecting, diagnosing, and resolving system incidents. By implementing runbooks in Java, you can automate incident detection, classification, and initial response, reducing mean time to resolution (MTTR).


Core Concepts

What are Incident Response Runbooks?

  • Structured procedures for handling specific types of incidents
  • Automated workflows for detection, analysis, and remediation
  • Integration with monitoring, alerting, and communication systems

Key Components:

  • Incident Detection: Automated monitoring and alert triggers
  • Classification: Categorizing incidents by severity and type
  • Remediation Actions: Automated recovery steps
  • Communication: Notifying stakeholders and teams
  • Documentation: Logging actions and outcomes

Dependencies and Setup

Maven Dependencies
<properties>
<spring-boot.version>3.1.0</spring-boot.version>
<quartz.version>2.3.2</quartz.version>
<resilience4j.version>2.1.0</resilience4j.version>
</properties>
<dependencies>
<!-- Spring Boot -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
<version>${spring-boot.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
<version>${spring-boot.version}</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
<version>${spring-boot.version}</version>
</dependency>
<!-- Scheduling -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-quartz</artifactId>
<version>${spring-boot.version}</version>
</dependency>
<!-- Resilience -->
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
<version>${resilience4j.version}</version>
</dependency>
<!-- Email -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-mail</artifactId>
<version>${spring-boot.version}</version>
</dependency>
<!-- Monitoring -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
</dependencies>

Core Domain Models

1. Incident Model
@Entity
@Table(name = "incidents")
public class Incident {
@Id
@GeneratedValue(strategy = GenerationType.UUID)
private String id;
private String title;
private String description;
@Enumerated(EnumType.STRING)
private IncidentSeverity severity;
@Enumerated(EnumType.STRING)
private IncidentStatus status;
@Enumerated(EnumType.STRING)
private IncidentType type;
private String service;
private String component;
@ElementCollection
private Map<String, String> metadata = new HashMap<>();
private Instant detectedAt;
private Instant acknowledgedAt;
private Instant resolvedAt;
@OneToMany(mappedBy = "incident", cascade = CascadeType.ALL)
private List<IncidentAction> actions = new ArrayList<>();
@OneToMany(mappedBy = "incident", cascade = CascadeType.ALL)
private List<IncidentNotification> notifications = new ArrayList<>();
// Constructors, getters, setters
public Incident() {}
public Incident(String title, String description, IncidentSeverity severity, 
IncidentType type, String service, String component) {
this.title = title;
this.description = description;
this.severity = severity;
this.type = type;
this.service = service;
this.component = component;
this.status = IncidentStatus.OPEN;
this.detectedAt = Instant.now();
}
public void acknowledge() {
this.status = IncidentStatus.ACKNOWLEDGED;
this.acknowledgedAt = Instant.now();
}
public void resolve() {
this.status = IncidentStatus.RESOLVED;
this.resolvedAt = Instant.now();
}
public void addAction(IncidentAction action) {
this.actions.add(action);
}
public void addNotification(IncidentNotification notification) {
this.notifications.add(notification);
}
public Duration getTimeToAcknowledge() {
if (acknowledgedAt == null || detectedAt == null) return null;
return Duration.between(detectedAt, acknowledgedAt);
}
public Duration getTimeToResolve() {
if (resolvedAt == null || detectedAt == null) return null;
return Duration.between(detectedAt, resolvedAt);
}
}
public enum IncidentSeverity {
SEV1_CRITICAL,    // Service down, customer impact
SEV2_HIGH,        // Major degradation
SEV3_MEDIUM,      // Minor issues, limited impact
SEV4_LOW          // Informational, no immediate impact
}
public enum IncidentStatus {
OPEN,
ACKNOWLEDGED,
IN_PROGRESS,
RESOLVED,
CLOSED
}
public enum IncidentType {
DATABASE_UNAVAILABLE,
HIGH_CPU_USAGE,
MEMORY_LEAK,
NETWORK_LATENCY,
API_ERRORS,
DISK_SPACE,
SERVICE_UNAVAILABLE,
SECURITY_BREACH
}
2. Incident Actions and Notifications
@Entity
@Table(name = "incident_actions")
public class IncidentAction {
@Id
@GeneratedValue(strategy = GenerationType.UUID)
private String id;
@ManyToOne
@JoinColumn(name = "incident_id")
private Incident incident;
private String actionType;
private String description;
private String parameters;
@Enumerated(EnumType.STRING)
private ActionStatus status;
private String executedBy;
private Instant executedAt;
private String result;
private String errorMessage;
// Constructors, getters, setters
}
@Entity
@Table(name = "incident_notifications")
public class IncidentNotification {
@Id
@GeneratedValue(strategy = GenerationType.UUID)
private String id;
@ManyToOne
@JoinColumn(name = "incident_id")
private Incident incident;
private String channel; // EMAIL, SLACK, PAGER_DUTY, SMS
private String recipient;
private String message;
private Instant sentAt;
private boolean success;
private String errorMessage;
// Constructors, getters, setters
}
3. Runbook Templates
@Entity
@Table(name = "runbook_templates")
public class RunbookTemplate {
@Id
@GeneratedValue(strategy = GenerationType.UUID)
private String id;
private String name;
private String description;
@Enumerated(EnumType.STRING)
private IncidentType incidentType;
@Enumerated(EnumType.STRING)
private IncidentSeverity severity;
@ElementCollection
@CollectionTable(name = "runbook_steps")
@OrderColumn(name = "step_order")
private List<RunbookStep> steps = new ArrayList<>();
private boolean enabled = true;
private int version = 1;
// Constructors, getters, setters
}
@Embeddable
public class RunbookStep {
private int order;
private String name;
private String description;
@Enumerated(EnumType.STRING)
private StepType type; // AUTOMATED, MANUAL, DECISION
private String actionClass; // For automated steps
private String parameters;
private String successCriteria;
private Integer timeoutSeconds;
// Constructors, getters, setters
}
public enum StepType {
AUTOMATED,    // Executed automatically by system
MANUAL,       // Requires human intervention
DECISION,     // Conditional branching
NOTIFICATION  // Send notifications
}

Core Incident Management

1. Incident Detection Service
@Service
@Slf4j
public class IncidentDetectionService {
private final IncidentService incidentService;
private final RunbookService runbookService;
private final List<IncidentDetector> detectors;
public IncidentDetectionService(IncidentService incidentService, 
RunbookService runbookService,
List<IncidentDetector> detectors) {
this.incidentService = incidentService;
this.runbookService = runbookService;
this.detectors = detectors;
}
@Scheduled(fixedRate = 30000) // Check every 30 seconds
public void detectIncidents() {
log.info("Running incident detection checks");
for (IncidentDetector detector : detectors) {
try {
Optional<Incident> incidentOpt = detector.detect();
if (incidentOpt.isPresent()) {
Incident incident = incidentOpt.get();
log.info("Detected incident: {} - {}", incident.getType(), incident.getTitle());
// Check if similar incident already exists
if (!incidentService.hasActiveIncident(incident.getType(), incident.getService())) {
Incident created = incidentService.createIncident(incident);
runbookService.executeRunbook(created);
}
}
} catch (Exception e) {
log.error("Error in incident detector {}: {}", detector.getClass().getSimpleName(), e.getMessage());
}
}
}
}
public interface IncidentDetector {
Optional<Incident> detect();
}
@Component
@Slf4j
public class DatabaseHealthDetector implements IncidentDetector {
private final DataSource dataSource;
private final HealthEndpoint healthEndpoint;
private static final String CHECK_SQL = "SELECT 1";
public DatabaseHealthDetector(DataSource dataSource, HealthEndpoint healthEndpoint) {
this.dataSource = dataSource;
this.healthEndpoint = healthEndpoint;
}
@Override
public Optional<Incident> detect() {
try {
// Check database connectivity
try (Connection conn = dataSource.getConnection();
Statement stmt = conn.createStatement()) {
stmt.executeQuery(CHECK_SQL);
}
// Check health endpoint
HealthComponent health = healthEndpoint.healthForComponent("db");
if (health.getStatus() == Status.DOWN) {
return createDatabaseIncident("Database health check failed");
}
} catch (Exception e) {
log.warn("Database health check failed: {}", e.getMessage());
return createDatabaseIncident("Database connection failed: " + e.getMessage());
}
return Optional.empty();
}
private Optional<Incident> createDatabaseIncident(String reason) {
Incident incident = new Incident(
"Database Unavailable",
reason,
IncidentSeverity.SEV1_CRITICAL,
IncidentType.DATABASE_UNAVAILABLE,
"database",
"primary-db"
);
incident.addMetadata("detection_time", Instant.now().toString());
incident.addMetadata("error_type", "connectivity");
return Optional.of(incident);
}
}
@Component
@Slf4j
public class HighCpuDetector implements IncidentDetector {
private final OperatingSystemMXBean osBean;
private final IncidentService incidentService;
private final double CPU_THRESHOLD = 0.85; // 85%
private final int SAMPLE_COUNT = 5;
private final LinkedList<Double> cpuSamples = new LinkedList<>();
public HighCpuDetector(IncidentService incidentService) {
this.osBean = ManagementFactory.getOperatingSystemMXBean();
this.incidentService = incidentService;
}
@Override
public Optional<Incident> detect() {
double cpuUsage = getCpuUsage();
cpuSamples.add(cpuUsage);
// Keep only recent samples
if (cpuSamples.size() > SAMPLE_COUNT) {
cpuSamples.removeFirst();
}
// Check if we have sustained high CPU
if (cpuSamples.size() == SAMPLE_COUNT && 
cpuSamples.stream().allMatch(usage -> usage > CPU_THRESHOLD)) {
double averageCpu = cpuSamples.stream().mapToDouble(Double::doubleValue).average().orElse(0);
Incident incident = new Incident(
"High CPU Usage Detected",
String.format("Sustained high CPU usage: %.2f%%", averageCpu * 100),
IncidentSeverity.SEV2_HIGH,
IncidentType.HIGH_CPU_USAGE,
"application",
"main-service"
);
incident.addMetadata("cpu_usage", String.valueOf(averageCpu));
incident.addMetadata("threshold", String.valueOf(CPU_THRESHOLD));
incident.addMetadata("sample_count", String.valueOf(SAMPLE_COUNT));
return Optional.of(incident);
}
return Optional.empty();
}
private double getCpuUsage() {
if (osBean instanceof com.sun.management.OperatingSystemMXBean) {
com.sun.management.OperatingSystemMXBean sunOsBean = 
(com.sun.management.OperatingSystemMXBean) osBean;
return sunOsBean.getSystemCpuLoad();
}
return 0.0;
}
}
2. Runbook Execution Engine
@Service
@Slf4j
public class RunbookService {
private final RunbookTemplateRepository templateRepository;
private final IncidentService incidentService;
private final ActionExecutor actionExecutor;
private final NotificationService notificationService;
public RunbookService(RunbookTemplateRepository templateRepository,
IncidentService incidentService,
ActionExecutor actionExecutor,
NotificationService notificationService) {
this.templateRepository = templateRepository;
this.incidentService = incidentService;
this.actionExecutor = actionExecutor;
this.notificationService = notificationService;
}
public void executeRunbook(Incident incident) {
Optional<RunbookTemplate> templateOpt = findMatchingTemplate(incident);
if (templateOpt.isEmpty()) {
log.warn("No runbook template found for incident type: {}", incident.getType());
return;
}
RunbookTemplate template = templateOpt.get();
log.info("Executing runbook '{}' for incident: {}", template.getName(), incident.getId());
RunbookExecutionContext context = new RunbookExecutionContext(incident);
for (RunbookStep step : template.getSteps()) {
if (!executeStep(step, context)) {
log.error("Runbook step failed: {} for incident: {}", step.getName(), incident.getId());
break;
}
}
}
private Optional<RunbookTemplate> findMatchingTemplate(Incident incident) {
return templateRepository.findByIncidentTypeAndSeverityAndEnabledTrue(
incident.getType(), incident.getSeverity());
}
private boolean executeStep(RunbookStep step, RunbookExecutionContext context) {
log.info("Executing runbook step: {} for incident: {}", 
step.getName(), context.getIncident().getId());
try {
switch (step.getType()) {
case AUTOMATED:
return executeAutomatedStep(step, context);
case MANUAL:
return createManualAction(step, context);
case NOTIFICATION:
return executeNotificationStep(step, context);
case DECISION:
return evaluateDecisionStep(step, context);
default:
log.warn("Unknown step type: {}", step.getType());
return false;
}
} catch (Exception e) {
log.error("Error executing runbook step {}: {}", step.getName(), e.getMessage());
return false;
}
}
private boolean executeAutomatedStep(RunbookStep step, RunbookExecutionContext context) {
IncidentAction action = new IncidentAction();
action.setActionType("AUTOMATED");
action.setDescription(step.getDescription());
action.setParameters(step.getParameters());
action.setExecutedAt(Instant.now());
action.setExecutedBy("system");
try {
String result = actionExecutor.executeAction(step.getActionClass(), step.getParameters(), context);
action.setStatus(ActionStatus.COMPLETED);
action.setResult(result);
context.getIncident().addAction(action);
incidentService.saveIncident(context.getIncident());
return true;
} catch (Exception e) {
action.setStatus(ActionStatus.FAILED);
action.setErrorMessage(e.getMessage());
context.getIncident().addAction(action);
incidentService.saveIncident(context.getIncident());
return false;
}
}
private boolean createManualAction(RunbookStep step, RunbookExecutionContext context) {
IncidentAction action = new IncidentAction();
action.setActionType("MANUAL");
action.setDescription(step.getDescription());
action.setParameters(step.getParameters());
action.setStatus(ActionStatus.PENDING);
context.getIncident().addAction(action);
incidentService.saveIncident(context.getIncident());
// Notify team about manual action required
notificationService.notifyManualActionRequired(context.getIncident(), step);
return true; // Manual steps don't block execution
}
private boolean executeNotificationStep(RunbookStep step, RunbookExecutionContext context) {
return notificationService.sendIncidentNotification(context.getIncident(), step.getParameters());
}
private boolean evaluateDecisionStep(RunbookStep step, RunbookExecutionContext context) {
// Implement decision logic based on step parameters and incident context
log.info("Evaluating decision step: {}", step.getName());
return true; // Simplified implementation
}
}
public class RunbookExecutionContext {
private final Incident incident;
private final Map<String, Object> variables = new HashMap<>();
public RunbookExecutionContext(Incident incident) {
this.incident = incident;
}
// Getters, setters
public Incident getIncident() { return incident; }
public Map<String, Object> getVariables() { return variables; }
public void setVariable(String key, Object value) {
variables.put(key, value);
}
public Object getVariable(String key) {
return variables.get(key);
}
}
3. Action Execution Engine
@Service
@Slf4j
public class ActionExecutor {
private final ApplicationContext applicationContext;
private final Map<String, Class<? extends AutomatedAction>> actionRegistry = new HashMap<>();
public ActionExecutor(ApplicationContext applicationContext) {
this.applicationContext = applicationContext;
registerDefaultActions();
}
public String executeAction(String actionClass, String parameters, RunbookExecutionContext context) {
Class<? extends AutomatedAction> actionClazz = actionRegistry.get(actionClass);
if (actionClazz == null) {
throw new IllegalArgumentException("Unknown action class: " + actionClass);
}
AutomatedAction action = applicationContext.getBean(actionClazz);
return action.execute(parameters, context);
}
private void registerDefaultActions() {
actionRegistry.put("restartService", RestartServiceAction.class);
actionRegistry.put("scaleService", ScaleServiceAction.class);
actionRegistry.put("clearCache", ClearCacheAction.class);
actionRegistry.put("failoverDatabase", DatabaseFailoverAction.class);
actionRegistry.put("runDiagnostics", DiagnosticsAction.class);
}
}
public interface AutomatedAction {
String execute(String parameters, RunbookExecutionContext context);
}
@Component
@Slf4j
public class RestartServiceAction implements AutomatedAction {
private final KubernetesClient kubernetesClient;
public RestartServiceAction(KubernetesClient kubernetesClient) {
this.kubernetesClient = kubernetesClient;
}
@Override
public String execute(String parameters, RunbookExecutionContext context) {
try {
// Parse parameters: "namespace:default,deployment:user-service"
Map<String, String> params = parseParameters(parameters);
String namespace = params.getOrDefault("namespace", "default");
String deployment = params.get("deployment");
if (deployment == null) {
throw new IllegalArgumentException("Deployment name is required");
}
log.info("Restarting deployment {}/{}", namespace, deployment);
// Kubernetes restart logic
kubernetesClient.apps().deployments()
.inNamespace(namespace)
.withName(deployment)
.rolling().restart();
return String.format("Successfully restarted deployment %s in namespace %s", deployment, namespace);
} catch (Exception e) {
log.error("Failed to restart service: {}", e.getMessage());
throw new RuntimeException("Service restart failed: " + e.getMessage(), e);
}
}
private Map<String, String> parseParameters(String parameters) {
return Arrays.stream(parameters.split(","))
.map(param -> param.split(":"))
.filter(parts -> parts.length == 2)
.collect(Collectors.toMap(
parts -> parts[0].trim(),
parts -> parts[1].trim()
));
}
}
@Component
@Slf4j
public class ClearCacheAction implements AutomatedAction {
private final CacheManager cacheManager;
public ClearCacheAction(CacheManager cacheManager) {
this.cacheManager = cacheManager;
}
@Override
public String execute(String parameters, RunbookExecutionContext context) {
try {
if ("all".equalsIgnoreCase(parameters) || parameters == null) {
// Clear all caches
cacheManager.getCacheNames().forEach(cacheName -> {
Cache cache = cacheManager.getCache(cacheName);
if (cache != null) {
cache.clear();
}
});
return "Cleared all caches";
} else {
// Clear specific cache
Cache cache = cacheManager.getCache(parameters);
if (cache != null) {
cache.clear();
return "Cleared cache: " + parameters;
} else {
throw new IllegalArgumentException("Cache not found: " + parameters);
}
}
} catch (Exception e) {
log.error("Failed to clear cache: {}", e.getMessage());
throw new RuntimeException("Cache clearance failed: " + e.getMessage(), e);
}
}
}
4. Notification Service
@Service
@Slf4j
public class NotificationService {
private final JavaMailSender mailSender;
private final IncidentRepository incidentRepository;
private final NotificationTemplateService templateService;
@Value("${app.notification.email.from:[email protected]}")
private String fromEmail;
@Value("${app.notification.team.email:[email protected]}")
private String teamEmail;
public NotificationService(JavaMailSender mailSender,
IncidentRepository incidentRepository,
NotificationTemplateService templateService) {
this.mailSender = mailSender;
this.incidentRepository = incidentRepository;
this.templateService = templateService;
}
public boolean sendIncidentNotification(Incident incident, String notificationConfig) {
try {
Map<String, String> config = parseNotificationConfig(notificationConfig);
String channel = config.getOrDefault("channel", "EMAIL");
String recipients = config.getOrDefault("recipients", teamEmail);
switch (channel.toUpperCase()) {
case "EMAIL":
return sendEmailNotification(incident, recipients, config);
case "SLACK":
return sendSlackNotification(incident, recipients, config);
case "PAGER_DUTY":
return sendPagerDutyNotification(incident, config);
default:
log.warn("Unknown notification channel: {}", channel);
return false;
}
} catch (Exception e) {
log.error("Failed to send notification: {}", e.getMessage());
return false;
}
}
private boolean sendEmailNotification(Incident incident, String recipients, Map<String, String> config) {
try {
String subject = templateService.renderSubject(incident);
String body = templateService.renderBody(incident, config);
MimeMessage message = mailSender.createMimeMessage();
MimeMessageHelper helper = new MimeMessageHelper(message, true);
helper.setFrom(fromEmail);
helper.setTo(recipients.split(","));
helper.setSubject(subject);
helper.setText(body, true); // HTML content
mailSender.send(message);
log.info("Sent email notification for incident: {}", incident.getId());
saveNotificationRecord(incident, "EMAIL", recipients, body, true);
return true;
} catch (Exception e) {
log.error("Failed to send email notification: {}", e.getMessage());
saveNotificationRecord(incident, "EMAIL", recipients, null, false, e.getMessage());
return false;
}
}
public void notifyManualActionRequired(Incident incident, RunbookStep step) {
String subject = String.format("Manual Action Required: %s", incident.getTitle());
String body = String.format(
"Incident: %s\n\nManual step required: %s\n\nDescription: %s\n\nParameters: %s",
incident.getTitle(), step.getName(), step.getDescription(), step.getParameters()
);
// Send to on-call team
sendEmailNotification(incident, teamEmail, Map.of("subject", subject, "body", body));
}
private void saveNotificationRecord(Incident incident, String channel, String recipient, 
String message, boolean success) {
saveNotificationRecord(incident, channel, recipient, message, success, null);
}
private void saveNotificationRecord(Incident incident, String channel, String recipient,
String message, boolean success, String error) {
IncidentNotification notification = new IncidentNotification();
notification.setIncident(incident);
notification.setChannel(channel);
notification.setRecipient(recipient);
notification.setMessage(message);
notification.setSentAt(Instant.now());
notification.setSuccess(success);
notification.setErrorMessage(error);
incident.addNotification(notification);
incidentRepository.save(incident);
}
private Map<String, String> parseNotificationConfig(String config) {
return Arrays.stream(config.split(";"))
.map(param -> param.split("="))
.filter(parts -> parts.length == 2)
.collect(Collectors.toMap(
parts -> parts[0].trim(),
parts -> parts[1].trim()
));
}
// Placeholder methods for other channels
private boolean sendSlackNotification(Incident incident, String channel, Map<String, String> config) {
log.info("Sending Slack notification for incident: {} to channel: {}", incident.getId(), channel);
// Implement Slack webhook integration
return true;
}
private boolean sendPagerDutyNotification(Incident incident, Map<String, String> config) {
log.info("Sending PagerDuty notification for incident: {}", incident.getId());
// Implement PagerDuty API integration
return true;
}
}

REST API for Incident Management

1. Incident Controller
@RestController
@RequestMapping("/api/incidents")
@Slf4j
public class IncidentController {
private final IncidentService incidentService;
private final RunbookService runbookService;
public IncidentController(IncidentService incidentService, RunbookService runbookService) {
this.incidentService = incidentService;
this.runbookService = runbookService;
}
@GetMapping
public ResponseEntity<List<Incident>> getIncidents(
@RequestParam(required = false) IncidentStatus status,
@RequestParam(required = false) IncidentSeverity severity,
@RequestParam(required = false) String service) {
List<Incident> incidents = incidentService.findIncidents(status, severity, service);
return ResponseEntity.ok(incidents);
}
@GetMapping("/{id}")
public ResponseEntity<Incident> getIncident(@PathVariable String id) {
return incidentService.findById(id)
.map(ResponseEntity::ok)
.orElse(ResponseEntity.notFound().build());
}
@PostMapping
public ResponseEntity<Incident> createIncident(@RequestBody CreateIncidentRequest request) {
Incident incident = new Incident(
request.getTitle(),
request.getDescription(),
request.getSeverity(),
request.getType(),
request.getService(),
request.getComponent()
);
request.getMetadata().forEach(incident::addMetadata);
Incident created = incidentService.createIncident(incident);
runbookService.executeRunbook(created);
return ResponseEntity.status(HttpStatus.CREATED).body(created);
}
@PostMapping("/{id}/acknowledge")
public ResponseEntity<Incident> acknowledgeIncident(@PathVariable String id) {
return incidentService.acknowledgeIncident(id)
.map(ResponseEntity::ok)
.orElse(ResponseEntity.notFound().build());
}
@PostMapping("/{id}/resolve")
public ResponseEntity<Incident> resolveIncident(@PathVariable String id) {
return incidentService.resolveIncident(id)
.map(ResponseEntity::ok)
.orElse(ResponseEntity.notFound().build());
}
@PostMapping("/{id}/actions/{actionId}/execute")
public ResponseEntity<IncidentAction> executeAction(@PathVariable String id, 
@PathVariable String actionId) {
return incidentService.executeAction(id, actionId)
.map(ResponseEntity::ok)
.orElse(ResponseEntity.notFound().build());
}
@GetMapping("/metrics")
public ResponseEntity<IncidentMetrics> getMetrics(
@RequestParam(defaultValue = "30") int days) {
IncidentMetrics metrics = incidentService.calculateMetrics(days);
return ResponseEntity.ok(metrics);
}
}
2. Runbook Management Controller
@RestController
@RequestMapping("/api/runbooks")
@Slf4j
public class RunbookController {
private final RunbookTemplateRepository templateRepository;
public RunbookController(RunbookTemplateRepository templateRepository) {
this.templateRepository = templateRepository;
}
@GetMapping
public ResponseEntity<List<RunbookTemplate>> getRunbooks() {
List<RunbookTemplate> runbooks = templateRepository.findAll();
return ResponseEntity.ok(runbooks);
}
@GetMapping("/{id}")
public ResponseEntity<RunbookTemplate> getRunbook(@PathVariable String id) {
return templateRepository.findById(id)
.map(ResponseEntity::ok)
.orElse(ResponseEntity.notFound().build());
}
@PostMapping
public ResponseEntity<RunbookTemplate> createRunbook(@RequestBody RunbookTemplate template) {
template.setId(null); // Ensure new entity
RunbookTemplate saved = templateRepository.save(template);
return ResponseEntity.status(HttpStatus.CREATED).body(saved);
}
@PutMapping("/{id}")
public ResponseEntity<RunbookTemplate> updateRunbook(@PathVariable String id, 
@RequestBody RunbookTemplate template) {
if (!templateRepository.existsById(id)) {
return ResponseEntity.notFound().build();
}
template.setId(id);
RunbookTemplate updated = templateRepository.save(template);
return ResponseEntity.ok(updated);
}
@PostMapping("/{id}/execute")
public ResponseEntity<String> executeRunbook(@PathVariable String id,
@RequestBody ExecuteRunbookRequest request) {
// Implementation to manually trigger runbook execution
return ResponseEntity.accepted().body("Runbook execution started");
}
}

Monitoring and Metrics

1. Incident Metrics
@Service
@Slf4j
public class IncidentMetricsService {
private final IncidentRepository incidentRepository;
public IncidentMetricsService(IncidentRepository incidentRepository) {
this.incidentRepository = incidentRepository;
}
public IncidentMetrics calculateMetrics(int days) {
Instant since = Instant.now().minus(Duration.ofDays(days));
List<Incident> incidents = incidentRepository.findByDetectedAtAfter(since);
IncidentMetrics metrics = new IncidentMetrics();
metrics.setTotalIncidents(incidents.size());
metrics.setOpenIncidents(countByStatus(incidents, IncidentStatus.OPEN));
metrics.setResolvedIncidents(countByStatus(incidents, IncidentStatus.RESOLVED));
metrics.setAverageTimeToAcknowledge(calculateAverageTimeToAcknowledge(incidents));
metrics.setAverageTimeToResolve(calculateAverageTimeToResolve(incidents));
metrics.setSeverityBreakdown(calculateSeverityBreakdown(incidents));
metrics.setTypeBreakdown(calculateTypeBreakdown(incidents));
return metrics;
}
private long countByStatus(List<Incident> incidents, IncidentStatus status) {
return incidents.stream()
.filter(incident -> incident.getStatus() == status)
.count();
}
private Duration calculateAverageTimeToAcknowledge(List<Incident> incidents) {
List<Duration> times = incidents.stream()
.map(Incident::getTimeToAcknowledge)
.filter(Objects::nonNull)
.collect(Collectors.toList());
if (times.isEmpty()) return Duration.ZERO;
long averageMillis = (long) times.stream()
.mapToLong(Duration::toMillis)
.average()
.orElse(0);
return Duration.ofMillis(averageMillis);
}
private Map<IncidentSeverity, Long> calculateSeverityBreakdown(List<Incident> incidents) {
return incidents.stream()
.collect(Collectors.groupingBy(
Incident::getSeverity,
Collectors.counting()
));
}
// Similar methods for other metrics calculations
}
@Component
public class IncidentMetricsExporter {
private final MeterRegistry meterRegistry;
private final IncidentMetricsService metricsService;
public IncidentMetricsExporter(MeterRegistry meterRegistry, 
IncidentMetricsService metricsService) {
this.meterRegistry = meterRegistry;
this.metricsService = metricsService;
}
@Scheduled(fixedRate = 60000) // Export every minute
public void exportMetrics() {
IncidentMetrics metrics = metricsService.calculateMetrics(7); // Last 7 days
Gauge.builder("incidents.total")
.description("Total number of incidents")
.register(meterRegistry, metrics.getTotalIncidents());
Gauge.builder("incidents.open")
.description("Number of open incidents")
.register(meterRegistry, metrics.getOpenIncidents());
Gauge.builder("incidents.mttr")
.description("Mean Time To Resolution in seconds")
.register(meterRegistry, 
metrics.getAverageTimeToResolve().getSeconds());
}
}

Configuration

1. Application Properties
# application.yml
app:
incident:
auto-detection:
enabled: true
interval: 30000
notification:
email:
from: [email protected]
team: [email protected]
slack:
webhook-url: ${SLACK_WEBHOOK_URL:}
pagerduty:
api-key: ${PAGERDUTY_API_KEY:}
spring:
datasource:
url: jdbc:postgresql://localhost:5432/incident_management
username: ${DB_USERNAME:postgres}
password: ${DB_PASSWORD:password}
jpa:
hibernate:
ddl-auto: update
show-sql: false
mail:
host: smtp.company.com
port: 587
username: ${SMTP_USERNAME:}
password: ${SMTP_PASSWORD:}
properties:
mail:
smtp:
auth: true
starttls:
enable: true
management:
endpoints:
web:
exposure:
include: health,metrics,prometheus
endpoint:
health:
show-details: always
2. Spring Boot Configuration
@Configuration
@EnableScheduling
@EnableAsync
public class IncidentManagementConfig {
@Bean
@Primary
public ObjectMapper objectMapper() {
ObjectMapper mapper = new ObjectMapper();
mapper.registerModule(new JavaTimeModule());
mapper.disable(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS);
return mapper;
}
@Bean
public TaskExecutor incidentTaskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(5);
executor.setMaxPoolSize(10);
executor.setQueueCapacity(25);
executor.setThreadNamePrefix("incident-");
return executor;
}
}

Testing

1. Unit Tests
@ExtendWith(MockitoExtension.class)
class IncidentDetectionServiceTest {
@Mock
private IncidentService incidentService;
@Mock
private RunbookService runbookService;
@Mock
private IncidentDetector detector1;
@Mock
private IncidentDetector detector2;
@InjectMocks
private IncidentDetectionService detectionService;
@Test
void shouldDetectAndCreateIncident() {
// Given
Incident incident = new Incident("Test", "Description", 
IncidentSeverity.SEV3_MEDIUM, IncidentType.API_ERRORS, "service", "component");
when(detector1.detect()).thenReturn(Optional.of(incident));
when(detector2.detect()).thenReturn(Optional.empty());
when(incidentService.hasActiveIncident(any(), any())).thenReturn(false);
when(incidentService.createIncident(any())).thenReturn(incident);
// When
detectionService.detectIncidents();
// Then
verify(incidentService).createIncident(incident);
verify(runbookService).executeRunbook(incident);
}
}
@SpringBootTest
class RunbookServiceIntegrationTest {
@Autowired
private RunbookService runbookService;
@Autowired
private IncidentRepository incidentRepository;
@Test
void shouldExecuteRunbookForDatabaseIncident() {
// Given
Incident incident = new Incident("DB Down", "Database connection failed",
IncidentSeverity.SEV1_CRITICAL, IncidentType.DATABASE_UNAVAILABLE, "db", "primary");
Incident saved = incidentRepository.save(incident);
// When
runbookService.executeRunbook(saved);
// Then
Incident updated = incidentRepository.findById(saved.getId()).orElseThrow();
assertThat(updated.getActions()).isNotEmpty();
assertThat(updated.getNotifications()).isNotEmpty();
}
}

Best Practices

  1. Idempotent Actions: Ensure automated actions can be safely retried
  2. Comprehensive Logging: Log all runbook execution steps
  3. Circuit Breakers: Implement resilience patterns for external service calls
  4. Security: Secure runbook execution with proper authentication and authorization
  5. Testing: Thoroughly test all runbook scenarios
  6. Documentation: Maintain clear documentation for manual steps
  7. Continuous Improvement: Regularly review and update runbooks based on incident learnings
@Component
@Slf4j
public class ResilientActionExecutor {
private final ActionExecutor actionExecutor;
private final RetryRegistry retryRegistry;
public ResilientActionExecutor(ActionExecutor actionExecutor, RetryRegistry retryRegistry) {
this.actionExecutor = actionExecutor;
this.retryRegistry = retryRegistry;
}
public String executeWithRetry(String actionClass, String parameters, RunbookExecutionContext context) {
Retry retry = retryRegistry.retry("actionExecution");
return Retry.decorateSupplier(retry, () -> {
try {
return actionExecutor.executeAction(actionClass, parameters, context);
} catch (Exception e) {
log.warn("Action execution failed, retrying: {}", e.getMessage());
throw e;
}
}).get();
}
}

Conclusion

Implementing incident response runbooks in Java provides:

  • Automated incident detection and classification
  • Structured response procedures with both automated and manual steps
  • Comprehensive tracking of all incident-related actions
  • Stakeholder communication through multiple channels
  • Performance metrics for continuous improvement

This system reduces MTTR, ensures consistent incident handling, and provides valuable data for post-incident analysis and process improvement. The modular design allows for easy extension with new detectors, actions, and notification channels as your system evolves.

Leave a Reply

Your email address will not be published. Required fields are marked *


Macro Nepal Helper