Automating Incident Response: Building Java-Based Runbook Execution Systems

Article

In modern DevOps and SRE practices, incident response runbooks are crucial for quickly resolving system failures and outages. While traditional runbooks are document-based, there's a growing trend towards automated runbooks - executable procedures that can diagnose, remediate, or escalate issues with minimal human intervention.

In this guide, we'll explore how to implement automated incident response runbooks in Java, creating a framework that can execute complex remediation workflows programmatically.

Why Automate Runbooks with Java?

  • Speed: Automated responses can react in milliseconds vs. human minutes
  • Consistency: Eliminate human error in following complex procedures
  • 24/7 Availability: Automated systems don't sleep or take breaks
  • Integration: Easily connect with monitoring, alerting, and orchestration systems
  • Scalability: Handle multiple incidents simultaneously across distributed systems

Architecture Overview

A Java-based runbook system typically consists of:

  1. Runbook Definitions (YAML/JSON/Java DSL)
  2. Execution Engine (Workflow orchestrator)
  3. Action Library (Reusable remediation steps)
  4. Context & State Management
  5. Integration Adapters (APIs, Databases, Cloud Platforms)

Part 1: Core Runbook Framework

1.1 Defining the Runbook Model

Let's start by creating a domain model for our automated runbooks.

// File: src/main/java/com/incident/runbook/model/Runbook.java
public record Runbook(
String id,
String name,
String description,
String triggerCondition,
List<RunbookStep> steps,
Map<String, Object> defaultParameters
) {}
// File: src/main/java/com/incident/runbook/model/RunbookStep.java
public record RunbookStep(
String id,
String name,
StepType type,
String actionClass,
Map<String, Object> parameters,
String successCondition,
String retryPolicy,
List<String> nextSteps,
int timeoutSeconds
) {}
enum StepType {
VALIDATION, REMEDIATION, ESCALATION, NOTIFICATION, MANUAL_INTERVENTION
}
// File: src/main/java/com/incident/runbook/model/ExecutionContext.java
public class ExecutionContext {
private final String executionId;
private final Runbook runbook;
private final Map<String, Object> parameters;
private final Map<String, Object> stepOutputs;
private final Instant startTime;
private RunbookStatus status;
private String currentStep;
// Constructors, getters, and utility methods
public void storeStepOutput(String stepId, Object output) {
stepOutputs.put(stepId, output);
}
public <T> T getStepOutput(String stepId, Class<T> type) {
return type.cast(stepOutputs.get(stepId));
}
}
enum RunbookStatus {
RUNNING, COMPLETED, FAILED, WAITING_FOR_INTERVENTION, ROLLED_BACK
}

1.2 Defining the Step Interface

Create a contract for all executable runbook steps:

// File: src/main/java/com/incident/runbook/action/RunbookAction.java
public interface RunbookAction {
String getName();
StepResult execute(ExecutionContext context);
StepResult rollback(ExecutionContext context);
}
// File: src/main/java/com/incident/runbook/action/StepResult.java
public record StepResult(
boolean success,
String message,
Object output,
Duration executionTime,
Throwable error
) {
public static StepResult success(String message, Object output) {
return new StepResult(true, message, output, Duration.ZERO, null);
}
public static StepResult failure(String message, Throwable error) {
return new StepResult(false, message, null, Duration.ZERO, error);
}
}

Part 2: Implementing Runbook Actions

2.1 Common Incident Response Actions

Database Connection Pool Reset:

// File: src/main/java/com/incident/runbook/action/database/ResetConnectionPoolAction.java
@Component
public class ResetConnectionPoolAction implements RunbookAction {
private static final Logger logger = LoggerFactory.getLogger(ResetConnectionPoolAction.class);
@Override
public String getName() {
return "ResetDatabaseConnectionPool";
}
@Override
public StepResult execute(ExecutionContext context) {
try {
String dataSourceName = (String) context.getParameters().get("dataSourceName");
logger.info("Resetting connection pool for: {}", dataSourceName);
// Simulate connection pool reset logic
boolean resetSuccess = resetConnectionPool(dataSourceName);
if (resetSuccess) {
return StepResult.success(
"Successfully reset connection pool: " + dataSourceName,
Map.of("resetTimestamp", Instant.now())
);
} else {
return StepResult.failure("Failed to reset connection pool: " + dataSourceName, null);
}
} catch (Exception e) {
return StepResult.failure("Error resetting connection pool", e);
}
}
@Override
public StepResult rollback(ExecutionContext context) {
// Connection pool reset typically doesn't need rollback
return StepResult.success("Rollback not required for connection pool reset", null);
}
private boolean resetConnectionPool(String dataSourceName) {
// Implementation would integrate with HikariCP, Tomcat JDBC, etc.
// This is a simplified example
try {
Thread.sleep(1000); // Simulate reset time
return true;
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
return false;
}
}
}

Service Restart Action:

// File: src/main/java/com/incident/runbook/action/infrastructure/RestartServiceAction.java
@Component
public class RestartServiceAction implements RunbookAction {
@Autowired
private KubernetesClient kubernetesClient;
@Override
public String getName() {
return "RestartService";
}
@Override
public StepResult execute(ExecutionContext context) {
String namespace = (String) context.getParameters().get("namespace");
String deployment = (String) context.getParameters().get("deployment");
try {
logger.info("Restarting deployment {}/{}", namespace, deployment);
kubernetesClient.apps()
.deployments()
.inNamespace(namespace)
.withName(deployment)
.rolling()
.restart();
// Wait for rollout to complete
boolean rolloutSuccess = waitForRollout(namespace, deployment, 300);
if (rolloutSuccess) {
return StepResult.success(
"Successfully restarted deployment: " + deployment,
Map.of("restartTime", Instant.now())
);
} else {
return StepResult.failure("Deployment restart timeout or failed", null);
}
} catch (Exception e) {
return StepResult.failure("Failed to restart deployment: " + deployment, e);
}
}
private boolean waitForRollout(String namespace, String deployment, int timeoutSeconds) {
// Implementation to wait for Kubernetes rollout completion
return true;
}
@Override
public StepResult rollback(ExecutionContext context) {
// Service restart rollback would involve restoring previous version
// This could use blue-green deployment patterns
return StepResult.success("Initiated rollback for service restart", null);
}
}

Cache Clear Action:

// File: src/main/java/com/incident/runbook/action/cache/ClearCacheAction.java
@Component
public class ClearCacheAction implements RunbookAction {
@Autowired
private CacheManager cacheManager;
@Override
public String getName() {
return "ClearCache";
}
@Override
public StepResult execute(ExecutionContext context) {
String cacheName = (String) context.getParameters().get("cacheName");
String pattern = (String) context.getParameters().get("pattern");
try {
Cache cache = cacheManager.getCache(cacheName);
if (cache != null) {
if ("ALL".equals(pattern)) {
cache.clear();
return StepResult.success("Cleared entire cache: " + cacheName, 
Map.of("keysCleared", "all"));
} else {
// Pattern-based cache clearing logic
int keysCleared = clearPattern(cache, pattern);
return StepResult.success("Cleared cache pattern: " + pattern,
Map.of("keysCleared", keysCleared));
}
} else {
return StepResult.failure("Cache not found: " + cacheName, null);
}
} catch (Exception e) {
return StepResult.failure("Error clearing cache: " + cacheName, e);
}
}
private int clearPattern(Cache cache, String pattern) {
// Implementation for pattern-based cache clearing
return 0;
}
@Override
public StepResult rollback(ExecutionContext context) {
// Cache clearing typically cannot be rolled back
return StepResult.success("Rollback not available for cache clearing", null);
}
}

Part 3: Runbook Execution Engine

3.1 Core Execution Engine

// File: src/main/java/com/incident/runbook/engine/RunbookEngine.java
@Component
public class RunbookEngine {
private final Map<String, RunbookAction> actionRegistry;
private final RunbookRepository runbookRepository;
private final ExecutionHistoryRepository historyRepository;
public RunbookExecutionResult executeRunbook(String runbookId, Map<String, Object> parameters) {
Runbook runbook = runbookRepository.findById(runbookId)
.orElseThrow(() -> new RunbookNotFoundException(runbookId));
ExecutionContext context = new ExecutionContext(
generateExecutionId(),
runbook,
parameters
);
historyRepository.saveExecutionStart(context);
try {
for (RunbookStep step : runbook.steps()) {
context.setCurrentStep(step.id());
StepResult result = executeStep(step, context);
historyRepository.saveStepResult(context.getExecutionId(), step.id(), result);
if (!result.success()) {
if (shouldRollback(step, context)) {
performRollback(context, step.id());
}
return RunbookExecutionResult.failure(context.getExecutionId(), 
"Step failed: " + step.id(), result.error());
}
context.storeStepOutput(step.id(), result.output());
}
historyRepository.saveExecutionComplete(context, RunbookStatus.COMPLETED);
return RunbookExecutionResult.success(context.getExecutionId());
} catch (Exception e) {
historyRepository.saveExecutionComplete(context, RunbookStatus.FAILED);
return RunbookExecutionResult.failure(context.getExecutionId(), 
"Execution failed", e);
}
}
private StepResult executeStep(RunbookStep step, ExecutionContext context) {
RunbookAction action = actionRegistry.get(step.actionClass());
if (action == null) {
return StepResult.failure("Action not found: " + step.actionClass(), null);
}
// Execute with timeout
return executeWithTimeout(() -> action.execute(context), step.timeoutSeconds());
}
private StepResult executeWithTimeout(Supplier<StepResult> action, int timeoutSeconds) {
ExecutorService executor = Executors.newSingleThreadExecutor();
Future<StepResult> future = executor.submit(action::get);
try {
return future.get(timeoutSeconds, TimeUnit.SECONDS);
} catch (TimeoutException e) {
future.cancel(true);
return StepResult.failure("Step execution timeout", e);
} catch (Exception e) {
return StepResult.failure("Step execution failed", e);
} finally {
executor.shutdown();
}
}
private void performRollback(ExecutionContext context, String failedStepId) {
// Implement rollback logic for all completed steps
logger.info("Initiating rollback due to failure at step: {}", failedStepId);
}
}

Part 4: Runbook Definitions & Configuration

4.1 YAML-Based Runbook Configuration

# File: src/main/resources/runbooks/database-connection-reset.yaml
id: "db-connection-reset"
name: "Database Connection Pool Reset"
description: "Reset database connection pool when connection errors exceed threshold"
triggerCondition: "db.connection.errors > 100 per 5min"
parameters:
dataSourceName: "primary-db"
maxResetAttempts: 3
steps:
- id: "validate-db-connectivity"
name: "Validate Database Connectivity"
type: "VALIDATION"
actionClass: "com.incident.runbook.action.database.ValidateConnectivityAction"
parameters:
timeoutMs: 5000
successCondition: "output.connectivity == true"
timeoutSeconds: 10
- id: "check-connection-pool"
name: "Check Connection Pool Status"
type: "VALIDATION"  
actionClass: "com.incident.runbook.action.database.CheckPoolStatusAction"
parameters:
metrics: ["activeConnections", "idleConnections", "waitingThreads"]
successCondition: "output.waitingThreads > 50"
timeoutSeconds: 5
- id: "reset-connection-pool"
name: "Reset Connection Pool"
type: "REMEDIATION"
actionClass: "com.incident.runbook.action.database.ResetConnectionPoolAction"
parameters:
resetMode: "SOFT"
timeoutSeconds: 30
- id: "verify-recovery"
name: "Verify Recovery"
type: "VALIDATION"
actionClass: "com.incident.runbook.action.database.ValidateConnectivityAction"
parameters:
timeoutMs: 5000
successCondition: "output.connectivity == true"
timeoutSeconds: 10
- id: "notify-team"
name: "Notify Team"
type: "NOTIFICATION"
actionClass: "com.incident.runbook.action.communication.SlackNotificationAction"
parameters:
channel: "#database-alerts"
message: "Database connection pool reset completed successfully"
timeoutSeconds: 10

4.2 Loading Runbook Configuration

// File: src/main/java/com/incident/runbook/config/RunbookConfig.java
@Configuration
public class RunbookConfig {
@Bean
public RunbookRepository runbookRepository() {
return new YamlRunbookRepository();
}
}
// File: src/main/java/com/incident/runbook/repository/YamlRunbookRepository.java
@Component
public class YamlRunbookRepository implements RunbookRepository {
private final ObjectMapper yamlMapper;
private final Map<String, Runbook> runbooks = new ConcurrentHashMap<>();
public YamlRunbookRepository() {
yamlMapper = new ObjectMapper(new YAMLFactory());
loadRunbooks();
}
private void loadRunbooks() {
try {
Resource[] resources = new PathMatchingResourcePatternResolver()
.getResources("classpath:runbooks/*.yaml");
for (Resource resource : resources) {
Runbook runbook = yamlMapper.readValue(resource.getInputStream(), Runbook.class);
runbooks.put(runbook.id(), runbook);
}
} catch (IOException e) {
throw new RuntimeException("Failed to load runbooks", e);
}
}
@Override
public Optional<Runbook> findById(String id) {
return Optional.ofNullable(runbooks.get(id));
}
@Override
public List<Runbook> findAll() {
return new ArrayList<>(runbooks.values());
}
}

Part 5: Testing Automated Runbooks

5.1 Unit Testing Runbook Actions

// File: src/test/java/com/incident/runbook/action/ResetConnectionPoolActionTest.java
@ExtendWith(MockitoExtension.class)
class ResetConnectionPoolActionTest {
@InjectMocks
private ResetConnectionPoolAction action;
@Test
void shouldSuccessfullyResetConnectionPool() {
// Given
ExecutionContext context = new ExecutionContext(
"test-execution",
mock(Runbook.class),
Map.of("dataSourceName", "test-db")
);
// When
StepResult result = action.execute(context);
// Then
assertThat(result.success()).isTrue();
assertThat(result.message()).contains("Successfully reset connection pool");
}
@Test
void shouldHandleResetFailure() {
// Test failure scenarios
}
}

5.2 Integration Testing Full Runbooks

// File: src/test/java/com/incident/runbook/engine/RunbookEngineIntegrationTest.java
@SpringBootTest
@TestPropertySource(properties = {
"runbook.automation.enabled=true"
})
class RunbookEngineIntegrationTest {
@Autowired
private RunbookEngine runbookEngine;
@Autowired
private RunbookRepository runbookRepository;
@Test
void shouldExecuteCompleteDatabaseRecoveryRunbook() {
// Given
String runbookId = "db-connection-reset";
Map<String, Object> parameters = Map.of(
"dataSourceName", "test-database",
"maxResetAttempts", 2
);
// When
RunbookExecutionResult result = runbookEngine.executeRunbook(runbookId, parameters);
// Then
assertThat(result.success()).isTrue();
assertThat(result.executionId()).isNotNull();
}
}

Best Practices for Java-Based Runbooks

  1. Idempotency: Design actions to be safely retryable
  2. Observability: Instrument all steps with comprehensive logging and metrics
  3. Circuit Breakers: Implement failure detection to prevent cascading failures
  4. Security: Secure runbook execution with proper authentication and authorization
  5. Versioning: Maintain version control for runbook definitions
  6. Testing: Implement comprehensive testing for all runbook actions
  7. Rollback Strategies: Always plan for failure with proper rollback procedures
  8. Human Oversight: Include manual approval steps for dangerous operations

Conclusion

Building automated incident response runbooks in Java enables organizations to respond to system failures quickly, consistently, and at scale. By creating a flexible framework with reusable actions, comprehensive execution engine, and proper testing, you can significantly reduce mean time to resolution (MTTR) for common incidents.

The key to success is starting with well-understood, repetitive incidents and gradually expanding your automated runbook library as you build confidence in the system. Remember that automation should augment human responders, not replace them - always include appropriate escalation paths and manual intervention points for complex scenarios.

Leave a Reply

Your email address will not be published. Required fields are marked *


Macro Nepal Helper