In today's distributed systems, failures are inevitable. Chaos Engineering is the practice of intentionally injecting failures into systems to build resilience and confidence. Chaos Monkey for Spring Boot brings this practice to the Java ecosystem by randomly attacking your Spring Boot applications, helping you discover weaknesses before they cause real outages.
What is Chaos Monkey for Spring Boot?
Chaos Monkey for Spring Boot is a Java library that implements the principles of Chaos Engineering specifically for Spring Boot applications. It randomly attacks different parts of your application to test its resilience and help you:
- Discover hidden weaknesses in your system
- Verify monitoring and alerting systems
- Build confidence in your failure handling
- Test circuit breakers and fallback mechanisms
- Prepare teams for real incidents
Types of Attacks
- Latency Attacks: Introduce artificial delays in method execution
- Exception Attacks: Throw random exceptions from methods
- AppKiller Attacks: Terminate the application entirely
- Memory Attacks: Fill up the heap memory
- CPU Attacks: Spike CPU usage
Architecture Overview
[Chaos Monkey] → [Spring Boot Application] → [Monitoring] → [Alerts] | | | | Inject failures Execute business Collect metrics Notify team based on config logic with on performance when thresholds and watchers random disruptions and errors are breached
Hands-On Tutorial: Implementing Chaos Monkey in Spring Boot
Let's build a resilient e-commerce service with Chaos Monkey to test its failure handling capabilities.
Step 1: Project Setup
Maven Dependencies (pom.xml):
<properties>
<chaos-monkey-spring-boot.version>2.7.1</chaos-monkey-spring-boot.version>
</properties>
<dependencies>
<!-- Spring Boot Starter Web -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- Spring Boot Actuator -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- Chaos Monkey for Spring Boot -->
<dependency>
<groupId>de.codecentric</groupId>
<artifactId>chaos-monkey-spring-boot</artifactId>
<version>${chaos-monkey-spring-boot.version}</version>
</dependency>
<!-- Resilience4j for circuit breaker -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot2</artifactId>
</dependency>
<!-- Monitoring -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-rest</artifactId>
</dependency>
</dependencies>
Step 2: Configuration
application.yml:
# Chaos Monkey Configuration chaos: monkey: enabled: true # Watchers configuration - which components to attack watcher: repository: true service: true rest-controller: true component: true # Attacks configuration assaults: level: 5 # 1-10, how aggressive attacks are latency-range-start: 1000 # Minimum latency in ms latency-range-end: 5000 # Maximum latency in ms exceptions-active: true latency-active: true kill-application-active: false # Be careful with this! memory-active: true cpu-active: true # Runtime assault updates runtime-scope: assault-probability: 0.3 # 30% chance of attack latency-active: true exceptions-active: true memory-active: true # Memory assault configuration memory: millis-wait-next-increase: 1000 memory-fill-increment-megabytes: 50 memory-fill-target-fraction: 0.90 # Actuator endpoints management: endpoints: web: exposure: include: health,info,chaosmonkey,metrics endpoint: chaosmonkey: enabled: true # Resilience4j Circuit Breaker resilience4j: circuitbreaker: instances: productService: register-health-indicator: true sliding-window-size: 10 failure-rate-threshold: 50 wait-duration-in-open-state: 10s permitted-number-of-calls-in-half-open-state: 3
Step 3: Domain Model and Services
Product Entity:
@Entity
public class Product {
@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;
private String name;
private String description;
private BigDecimal price;
private Integer stockQuantity;
private boolean active = true;
// Constructors, getters, setters
public Product() {}
public Product(String name, String description, BigDecimal price, Integer stockQuantity) {
this.name = name;
this.description = description;
this.price = price;
this.stockQuantity = stockQuantity;
}
// Getters and setters...
}
Product Repository:
@Repository
public interface ProductRepository extends JpaRepository<Product, Long> {
// This method will be watched by Chaos Monkey
List<Product> findByActiveTrue();
// Chaos Monkey can attack this too
Optional<Product> findByName(String name);
}
Step 4: Service Layer with Resilience Patterns
Product Service:
@Service
public class ProductService {
private static final Logger logger = LoggerFactory.getLogger(ProductService.class);
private final ProductRepository productRepository;
private final InventoryService inventoryService;
public ProductService(ProductRepository productRepository, InventoryService inventoryService) {
this.productRepository = productRepository;
this.inventoryService = inventoryService;
}
/**
* This method is watched by Chaos Monkey and protected by Circuit Breaker
*/
@ChaosMonkey
@CircuitBreaker(name = "productService", fallbackMethod = "getAllProductsFallback")
public List<Product> getAllActiveProducts() {
logger.info("Fetching all active products");
return productRepository.findByActiveTrue();
}
/**
* Fallback method when circuit breaker is open
*/
public List<Product> getAllProductsFallback(Exception e) {
logger.warn("Using fallback for product service due to: {}", e.getMessage());
return List.of(
new Product("Fallback Product", "Available when service is degraded",
BigDecimal.valueOf(9.99), 100)
);
}
/**
* Method with potential latency and exception attacks
*/
@ChaosMonkey
public Product getProductById(Long id) {
logger.info("Fetching product by id: {}", id);
return productRepository.findById(id)
.orElseThrow(() -> new ProductNotFoundException("Product not found with id: " + id));
}
/**
* Method that calls external service - perfect for chaos testing
*/
@ChaosMonkey
@TimeLimiter(name = "inventoryService")
@Retry(name = "inventoryService", fallbackMethod = "updateStockFallback")
public CompletableFuture<String> updateProductStock(Long productId, Integer quantity) {
logger.info("Updating stock for product: {}, quantity: {}", productId, quantity);
// This could be attacked by Chaos Monkey
Product product = getProductById(productId);
int newStock = product.getStockQuantity() - quantity;
if (newStock < 0) {
throw new InsufficientStockException("Not enough stock for product: " + productId);
}
product.setStockQuantity(newStock);
productRepository.save(product);
// Call external inventory service
return inventoryService.updateInventory(productId, newStock);
}
public CompletableFuture<String> updateStockFallback(Long productId, Integer quantity, Exception e) {
logger.error("Stock update failed for product: {}, using fallback", productId, e);
return CompletableFuture.completedFuture("Stock update queued for retry");
}
/**
* Method that simulates expensive operation
*/
@ChaosMonkey
public Product createProduct(Product product) {
logger.info("Creating new product: {}", product.getName());
// Simulate some business logic that could be attacked
validateProduct(product);
calculateProductMetrics(product);
return productRepository.save(product);
}
private void validateProduct(Product product) {
if (product.getPrice().compareTo(BigDecimal.ZERO) <= 0) {
throw new IllegalArgumentException("Product price must be positive");
}
}
private void calculateProductMetrics(Product product) {
// Simulate expensive calculation
try {
Thread.sleep(100);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}
// Custom Exceptions
class ProductNotFoundException extends RuntimeException {
public ProductNotFoundException(String message) { super(message); }
}
class InsufficientStockException extends RuntimeException {
public InsufficientStockException(String message) { super(message); }
}
Inventory Service (External Service Simulation):
@Service
public class InventoryService {
private static final Logger logger = LoggerFactory.getLogger(InventoryService.class);
/**
* This simulates an external service call that can fail
*/
@ChaosMonkey
public CompletableFuture<String> updateInventory(Long productId, Integer newStock) {
logger.info("Updating external inventory for product: {}, stock: {}", productId, newStock);
// Simulate external API call
try {
Thread.sleep(200); // Network latency
// Random failures to simulate real-world conditions
if (Math.random() < 0.2) { // 20% chance of failure
throw new RuntimeException("Inventory service temporarily unavailable");
}
return CompletableFuture.completedFuture("Inventory updated successfully");
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new RuntimeException("Inventory update interrupted", e);
}
}
}
Step 5: REST Controller
@RestController
@RequestMapping("/api/products")
public class ProductController {
private final ProductService productService;
public ProductController(ProductService productService) {
this.productService = productService;
}
@GetMapping
public ResponseEntity<List<Product>> getAllProducts() {
try {
List<Product> products = productService.getAllActiveProducts();
return ResponseEntity.ok(products);
} catch (Exception e) {
return ResponseEntity.status(503).build(); // Service Unavailable
}
}
@GetMapping("/{id}")
public ResponseEntity<Product> getProduct(@PathVariable Long id) {
try {
Product product = productService.getProductById(id);
return ResponseEntity.ok(product);
} catch (ProductNotFoundException e) {
return ResponseEntity.notFound().build();
} catch (Exception e) {
return ResponseEntity.status(503).build();
}
}
@PostMapping
public ResponseEntity<Product> createProduct(@RequestBody Product product) {
try {
Product created = productService.createProduct(product);
return ResponseEntity.status(201).body(created);
} catch (Exception e) {
return ResponseEntity.badRequest().build();
}
}
@PutMapping("/{id}/stock")
public ResponseEntity<String> updateStock(@PathVariable Long id, @RequestBody StockUpdateRequest request) {
try {
CompletableFuture<String> result = productService.updateProductStock(id, request.getQuantity());
return ResponseEntity.accepted().body(result.get(5, TimeUnit.SECONDS));
} catch (InsufficientStockException e) {
return ResponseEntity.badRequest().body(e.getMessage());
} catch (Exception e) {
return ResponseEntity.status(503).body("Service temporarily unavailable");
}
}
// DTO for stock updates
public static class StockUpdateRequest {
private Integer quantity;
public Integer getQuantity() { return quantity; }
public void setQuantity(Integer quantity) { this.quantity = quantity; }
}
}
Step 6: Chaos Monkey Configuration and Management
Custom Chaos Monkey Configuration:
@Configuration
public class ChaosMonkeyConfig {
@Bean
public ChaosMonkeySettings chaosMonkeySettings() {
return new ChaosMonkeySettings();
}
@Bean
public WatcherProperties watcherProperties() {
WatcherProperties watcherProperties = new WatcherProperties();
watcherProperties.setService(true);
watcherProperties.setRepository(true);
watcherProperties.setRestController(true);
watcherProperties.setComponent(true);
return watcherProperties;
}
/**
* Custom assault properties for different environments
*/
@Bean
@Profile("chaos-test")
public AssaultProperties chaosTestAssaultProperties() {
AssaultProperties props = new AssaultProperties();
props.setLevel(8); // More aggressive in test environment
props.setRuntimeAssaultCronExpression("*/30 * * * * *"); // Every 30 seconds
return props;
}
@Bean
@Profile("production")
public AssaultProperties productionAssaultProperties() {
AssaultProperties props = new AssaultProperties();
props.setLevel(1); // Very conservative in production
props.setRuntimeAssaultCronExpression("0 0 2 * * *"); // 2 AM daily
props.setKillApplicationActive(false); // Never kill app in production
return props;
}
}
Chaos Monkey Management Controller:
@RestController
@RequestMapping("/api/chaos")
public class ChaosMonkeyManagementController {
private final ChaosMonkeyRuntimeScope chaosMonkeyRuntimeScope;
private final AssaultProperties assaultProperties;
public ChaosMonkeyManagementController(ChaosMonkeyRuntimeScope chaosMonkeyRuntimeScope,
AssaultProperties assaultProperties) {
this.chaosMonkeyRuntimeScope = chaosMonkeyRuntimeScope;
this.assaultProperties = assaultProperties;
}
@PostMapping("/assaults/enable")
public ResponseEntity<String> enableAssaults() {
chaosMonkeyRuntimeScope.enableAssaults();
return ResponseEntity.ok("Chaos Monkey assaults enabled");
}
@PostMapping("/assaults/disable")
public ResponseEntity<String> disableAssaults() {
chaosMonkeyRuntimeScope.disableAssaults();
return ResponseEntity.ok("Chaos Monkey assaults disabled");
}
@PostMapping("/assaults/level")
public ResponseEntity<String> setAssaultLevel(@RequestParam int level) {
assaultProperties.setLevel(level);
return ResponseEntity.ok("Assault level set to: " + level);
}
@GetMapping("/status")
public ResponseEntity<Map<String, Object>> getStatus() {
Map<String, Object> status = new HashMap<>();
status.put("assaultsEnabled", chaosMonkeyRuntimeScope.isAssaultsEnabled());
status.put("assaultLevel", assaultProperties.getLevel());
status.put("latencyActive", assaultProperties.getLatencyActive());
status.put("exceptionsActive", assaultProperties.getExceptionsActive());
return ResponseEntity.ok(status);
}
@PostMapping("/assaults/custom")
public ResponseEntity<String> triggerCustomAssault(@RequestBody CustomAssaultRequest request) {
switch (request.getType().toLowerCase()) {
case "latency":
// Custom latency assault logic
break;
case "exception":
// Custom exception assault
break;
case "memory":
// Custom memory assault
break;
default:
return ResponseEntity.badRequest().body("Unknown assault type: " + request.getType());
}
return ResponseEntity.ok("Custom assault triggered: " + request.getType());
}
public static class CustomAssaultRequest {
private String type;
private Map<String, Object> parameters;
// getters and setters
}
}
Step 7: Monitoring and Health Checks
Custom Health Indicator:
@Component
public class ChaosMonkeyHealthIndicator implements HealthIndicator {
private final ChaosMonkeyRuntimeScope chaosMonkeyRuntimeScope;
public ChaosMonkeyHealthIndicator(ChaosMonkeyRuntimeScope chaosMonkeyRuntimeScope) {
this.chaosMonkeyRuntimeScope = chaosMonkeyRuntimeScope;
}
@Override
public Health health() {
if (chaosMonkeyRuntimeScope.isAssaultsEnabled()) {
return Health.up()
.withDetail("chaos-monkey", "active")
.withDetail("assaults", "enabled")
.withDetail("message", "Chaos Monkey is actively testing resilience")
.build();
} else {
return Health.up()
.withDetail("chaos-monkey", "inactive")
.withDetail("assaults", "disabled")
.build();
}
}
}
Metrics Configuration:
@Configuration
public class MetricsConfig {
@Bean
public MeterRegistryCustomizer<MeterRegistry> metricsCommonTags() {
return registry -> registry.config().commonTags(
"application", "ecommerce-service",
"chaos-monkey", "enabled"
);
}
}
Running the Application
- Start the application with Chaos Monkey enabled:
java -jar target/your-app.jar --spring.profiles.active=chaos-test
- Access Chaos Monkey endpoints:
- Dashboard:
http://localhost:8080/actuator/chaosmonkey - Enable assaults:
POST http://localhost:8080/api/chaos/assaults/enable - Check status:
GET http://localhost:8080/api/chaos/status
- Test the application under chaos:
# While assaults are active, test your endpoints curl http://localhost:8080/api/products curl http://localhost:8080/api/products/1
- Monitor the logs for chaos events and system behavior
Production Best Practices
1. Gradual Rollout Strategy
# application-production.yml chaos: monkey: enabled: true assaults: level: 1 # Start very conservative kill-application-active: false # Never in production runtime-scope: assault-probability: 0.05 # 5% chance initially cron-expression: "0 0 2 * * *" # 2 AM during low traffic
2. Environment-Specific Profiles
@Configuration
@Profile("!production")
public class DevelopmentChaosConfig {
// More aggressive assaults in development
}
@Configuration
@Profile("production")
public class ProductionChaosConfig {
// Conservative, business-approved assaults only
}
3. Monitoring and Alerting
# Prometheus alert rules groups: - name: chaos-monkey rules: - alert: ChaosMonkeyActive expr: chaos_monkey_assaults_active > 0 labels: severity: warning annotations: summary: "Chaos Monkey is actively testing system resilience"
Testing Strategies
- Canary Testing: Enable chaos on a small subset of instances
- Game Days: Scheduled chaos testing with full team participation
- Automated Chaos: Integrate with CI/CD pipeline for resilience testing
Benefits
- Proactive Failure Discovery: Find weaknesses before customers do
- Improved Monitoring: Verify alerts and dashboards work correctly
- Team Confidence: Build trust in system resilience
- Better Incident Response: Teams practice handling failures regularly
Conclusion
Chaos Monkey for Spring Boot transforms failure from something we fear into something we embrace and learn from. By systematically injecting failures into your Java applications, you can:
- Build more resilient systems that gracefully handle failures
- Verify your monitoring and alerting actually works
- Train your teams to respond effectively to incidents
- Gain confidence that your system can withstand real-world chaos
Remember: Chaos Engineering is not about breaking things randomly—it's about building confidence in your system's capabilities through controlled experiments. Start small, learn continuously, and gradually increase the complexity of your chaos experiments as your system's resilience improves.