Article
Canary testing is a powerful deployment strategy that reduces risk by gradually rolling out changes to a small subset of users before full deployment. Unlike blue-green deployments that switch all traffic at once, canary releases allow you to validate new versions in production with real users while monitoring key metrics. This article will guide you through building a robust Canary Testing Framework in Java that can safely validate new deployments.
Understanding Canary Testing
Key Concepts:
- Canary: The new version deployed to a small percentage of users
- Baseline: The stable version serving the majority of users
- Traffic Splitting: Routing requests between canary and baseline
- Metrics Collection: Monitoring performance, errors, and business metrics
- Automated Rollback: Reverting if the canary shows problems
Architecture Overview
A complete canary testing framework consists of:
- Traffic Router: Distributes requests between canary and baseline
- Metrics Collector: Gathers performance and business metrics
- Analysis Engine: Evaluates canary health based on metrics
- Deployment Controller: Manages traffic splitting and rollbacks
Building the Canary Testing Framework
1. Core Dependencies
<properties>
<micrometer.version>1.11.5</micrometer.version>
<resilience4j.version>2.1.0</resilience4j.version>
</properties>
<dependencies>
<!-- Metrics collection -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-core</artifactId>
<version>${micrometer.version}</version>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
<version>${micrometer.version}</version>
</dependency>
<!-- Circuit breaker for rollback logic -->
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
<version>${resilience4j.version}</version>
</dependency>
<!-- HTTP client for canary analysis -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-webflux</artifactId>
</dependency>
<!-- Configuration -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
</dependencies>
2. Core Configuration Model
@ConfigurationProperties(prefix = "canary")
@Data
public class CanaryConfig {
private boolean enabled = false;
private double trafficPercentage = 5.0; // 5% to canary
private Duration analysisWindow = Duration.ofMinutes(10);
private int minimumRequests = 100;
private Map<String, Double> metricsThresholds = new HashMap<>();
private RollbackConfig rollback = new RollbackConfig();
@Data
public static class RollbackConfig {
private boolean autoRollback = true;
private double errorRateThreshold = 2.0; // 2%
private double p99LatencyThreshold = 500.0; // 500ms
private double businessMetricThreshold = -10.0; // -10% change
}
}
// Canary deployment metadata
@Data
@AllArgsConstructor
public class CanaryDeployment {
private String id;
private String serviceName;
private String baselineVersion;
private String canaryVersion;
private CanaryStatus status;
private Instant startTime;
private double currentTrafficPercentage;
private Map<String, Object> metadata;
}
enum CanaryStatus {
PENDING, RUNNING, PROMOTED, ROLLED_BACK, FAILED
}
3. Traffic Routing with Spring Boot
@Component
public class CanaryTrafficRouter {
private final CanaryConfig config;
private final Random random = new Random();
private final MeterRegistry meterRegistry;
// Track active canary deployments
private final Map<String, CanaryDeployment> activeDeployments = new ConcurrentHashMap<>();
public CanaryTrafficRouter(CanaryConfig config, MeterRegistry meterRegistry) {
this.config = config;
this.meterRegistry = meterRegistry;
}
public <T> T route(String serviceName, String operation,
Supplier<T> baseline, Supplier<T> canary) {
if (!config.isEnabled() || !shouldRouteToCanary(serviceName)) {
return baseline.get();
}
// Route to canary based on traffic percentage
if (random.nextDouble() * 100 < getTrafficPercentage(serviceName)) {
meterRegistry.counter("canary.requests",
"service", serviceName, "version", "canary").increment();
return executeWithMetrics(serviceName, operation, canary, "canary");
} else {
meterRegistry.counter("canary.requests",
"service", serviceName, "version", "baseline").increment();
return executeWithMetrics(serviceName, operation, baseline, "baseline");
}
}
private <T> T executeWithMetrics(String serviceName, String operation,
Supplier<T> supplier, String version) {
Timer.Sample sample = Timer.start(meterRegistry);
try {
T result = supplier.get();
meterRegistry.counter("canary.success",
"service", serviceName, "version", version).increment();
return result;
} catch (Exception e) {
meterRegistry.counter("canary.errors",
"service", serviceName, "version", version,
"error", e.getClass().getSimpleName()).increment();
throw e;
} finally {
sample.stop(Timer.builder("canary.duration")
.tags("service", serviceName, "operation", operation, "version", version)
.register(meterRegistry));
}
}
private boolean shouldRouteToCanary(String serviceName) {
CanaryDeployment deployment = activeDeployments.get(serviceName);
return deployment != null && deployment.getStatus() == CanaryStatus.RUNNING;
}
private double getTrafficPercentage(String serviceName) {
CanaryDeployment deployment = activeDeployments.get(serviceName);
return deployment != null ? deployment.getCurrentTrafficPercentage() : 0.0;
}
public void startDeployment(CanaryDeployment deployment) {
deployment.setStatus(CanaryStatus.RUNNING);
deployment.setStartTime(Instant.now());
activeDeployments.put(deployment.getServiceName(), deployment);
}
public void updateTrafficPercentage(String serviceName, double percentage) {
CanaryDeployment deployment = activeDeployments.get(serviceName);
if (deployment != null) {
deployment.setCurrentTrafficPercentage(percentage);
}
}
public void promoteDeployment(String serviceName) {
CanaryDeployment deployment = activeDeployments.get(serviceName);
if (deployment != null) {
deployment.setStatus(CanaryStatus.PROMOTED);
activeDeployments.remove(serviceName);
}
}
public void rollbackDeployment(String serviceName) {
CanaryDeployment deployment = activeDeployments.get(serviceName);
if (deployment != null) {
deployment.setStatus(CanaryStatus.ROLLED_BACK);
activeDeployments.remove(serviceName);
}
}
}
4. Metrics Collection and Analysis
@Component
@Slf4j
public class CanaryAnalysisEngine {
private final CanaryConfig config;
private final MeterRegistry meterRegistry;
private final WebClient metricsClient;
public CanaryAnalysisEngine(CanaryConfig config, MeterRegistry meterRegistry) {
this.config = config;
this.meterRegistry = meterRegistry;
this.metricsClient = WebClient.builder().build();
}
public CanaryAnalysisResult analyze(String serviceName) {
try {
CanaryMetrics baselineMetrics = fetchMetrics(serviceName, "baseline");
CanaryMetrics canaryMetrics = fetchMetrics(serviceName, "canary");
if (baselineMetrics.getTotalRequests() < config.getMinimumRequests() ||
canaryMetrics.getTotalRequests() < config.getMinimumRequests()) {
return CanaryAnalysisResult.insufficientData();
}
return performAnalysis(baselineMetrics, canaryMetrics);
} catch (Exception e) {
log.error("Failed to analyze canary for service: {}", serviceName, e);
return CanaryAnalysisResult.failed(e.getMessage());
}
}
private CanaryMetrics fetchMetrics(String serviceName, String version) {
// In a real implementation, this would query Prometheus, Datadog, etc.
double errorRate = meterRegistry.counter("canary.errors",
"service", serviceName, "version", version).count();
double successCount = meterRegistry.counter("canary.success",
"service", serviceName, "version", version).count();
double totalRequests = errorRate + successCount;
Timer timer = meterRegistry.timer("canary.duration",
"service", serviceName, "version", version);
return new CanaryMetrics(
totalRequests,
errorRate,
timer.percentile(0.5), // p50
timer.percentile(0.95), // p95
timer.percentile(0.99) // p99
);
}
private CanaryAnalysisResult performAnalysis(CanaryMetrics baseline,
CanaryMetrics canary) {
CanaryAnalysisResult result = new CanaryAnalysisResult();
// Calculate error rate difference
double baselineErrorRate = baseline.getErrorRate();
double canaryErrorRate = canary.getErrorRate();
double errorRateDiff = canaryErrorRate - baselineErrorRate;
// Calculate latency differences
double p99LatencyDiff = canary.getP99Latency() - baseline.getP99Latency();
double p95LatencyDiff = canary.getP95Latency() - baseline.getP95Latency();
result.setErrorRateDifference(errorRateDiff);
result.setP99LatencyDifference(p99LatencyDiff);
result.setP95LatencyDifference(p95LatencyDiff);
result.setTotalRequests(canary.getTotalRequests());
// Evaluate against thresholds
boolean errorRateOk = errorRateDiff <= config.getRollback().getErrorRateThreshold();
boolean latencyOk = p99LatencyDiff <= config.getRollback().getP99LatencyThreshold();
result.setHealthy(errorRateOk && latencyOk);
result.setMessage(String.format(
"Error rate: %.2f%%, P99 latency: %.2fms",
errorRateDiff, p99LatencyDiff));
return result;
}
}
@Data
class CanaryMetrics {
private final double totalRequests;
private final double errorCount;
private final double p50Latency;
private final double p95Latency;
private final double p99Latency;
public double getErrorRate() {
return totalRequests > 0 ? (errorCount / totalRequests) * 100 : 0.0;
}
}
@Data
class CanaryAnalysisResult {
private boolean healthy = false;
private boolean sufficientData = true;
private String message = "";
private double errorRateDifference;
private double p99LatencyDifference;
private double p95LatencyDifference;
private double totalRequests;
public static CanaryAnalysisResult insufficientData() {
CanaryAnalysisResult result = new CanaryAnalysisResult();
result.setSufficientData(false);
result.setMessage("Insufficient data for analysis");
return result;
}
public static CanaryAnalysisResult failed(String error) {
CanaryAnalysisResult result = new CanaryAnalysisResult();
result.setMessage("Analysis failed: " + error);
return result;
}
}
5. Canary Deployment Manager
@Service
@Slf4j
public class CanaryDeploymentManager {
private final CanaryTrafficRouter trafficRouter;
private final CanaryAnalysisEngine analysisEngine;
private final CanaryConfig config;
private final ScheduledExecutorService scheduler;
private final Map<String, CanaryDeployment> deployments = new ConcurrentHashMap<>();
public CanaryDeploymentManager(CanaryTrafficRouter trafficRouter,
CanaryAnalysisEngine analysisEngine,
CanaryConfig config) {
this.trafficRouter = trafficRouter;
this.analysisEngine = analysisEngine;
this.config = config;
this.scheduler = Executors.newScheduledThreadPool(2);
}
public CanaryDeployment startDeployment(String serviceName,
String baselineVersion,
String canaryVersion,
Map<String, Object> metadata) {
String deploymentId = generateDeploymentId(serviceName);
CanaryDeployment deployment = new CanaryDeployment(
deploymentId, serviceName, baselineVersion, canaryVersion,
CanaryStatus.PENDING, Instant.now(), config.getTrafficPercentage(), metadata
);
deployments.put(deploymentId, deployment);
trafficRouter.startDeployment(deployment);
// Schedule periodic analysis
scheduler.scheduleAtFixedRate(() ->
analyzeDeployment(deploymentId), 1, 1, TimeUnit.MINUTES);
log.info("Started canary deployment: {} for service: {}",
deploymentId, serviceName);
return deployment;
}
public void analyzeDeployment(String deploymentId) {
CanaryDeployment deployment = deployments.get(deploymentId);
if (deployment == null || deployment.getStatus() != CanaryStatus.RUNNING) {
return;
}
CanaryAnalysisResult result = analysisEngine.analyze(deployment.getServiceName());
if (!result.isSufficientData()) {
log.info("Insufficient data for deployment: {}", deploymentId);
return;
}
log.info("Canary analysis for {}: healthy={}, message={}",
deploymentId, result.isHealthy(), result.getMessage());
if (result.isHealthy()) {
handleHealthyCanary(deployment, result);
} else {
handleUnhealthyCanary(deployment, result);
}
}
private void handleHealthyCanary(CanaryDeployment deployment,
CanaryAnalysisResult result) {
double currentTraffic = deployment.getCurrentTrafficPercentage();
// Gradually increase traffic if healthy
if (currentTraffic < 100.0) {
double newTraffic = Math.min(currentTraffic * 1.5, 100.0);
deployment.setCurrentTrafficPercentage(newTraffic);
trafficRouter.updateTrafficPercentage(deployment.getServiceName(), newTraffic);
log.info("Increased traffic to {}% for deployment: {}",
newTraffic, deployment.getId());
} else {
// Promote to 100% traffic
trafficRouter.promoteDeployment(deployment.getServiceName());
deployment.setStatus(CanaryStatus.PROMOTED);
log.info("Promoted deployment: {}", deployment.getId());
}
}
private void handleUnhealthyCanary(CanaryDeployment deployment,
CanaryAnalysisResult result) {
if (config.getRollback().isAutoRollback()) {
trafficRouter.rollbackDeployment(deployment.getServiceName());
deployment.setStatus(CanaryStatus.ROLLED_BACK);
log.warn("Auto-rollback triggered for deployment: {}. Reason: {}",
deployment.getId(), result.getMessage());
} else {
log.warn("Canary unhealthy but auto-rollback disabled: {}. Reason: {}",
deployment.getId(), result.getMessage());
}
}
private String generateDeploymentId(String serviceName) {
return serviceName + "-" + Instant.now().toEpochMilli();
}
public Optional<CanaryDeployment> getDeployment(String deploymentId) {
return Optional.ofNullable(deployments.get(deploymentId));
}
public List<CanaryDeployment> getActiveDeployments() {
return deployments.values().stream()
.filter(d -> d.getStatus() == CanaryStatus.RUNNING)
.collect(Collectors.toList());
}
}
6. Spring Boot Integration
@RestController
@RequestMapping("/api/canary")
@Slf4j
public class CanaryController {
private final CanaryDeploymentManager deploymentManager;
private final CanaryTrafficRouter trafficRouter;
public CanaryController(CanaryDeploymentManager deploymentManager,
CanaryTrafficRouter trafficRouter) {
this.deploymentManager = deploymentManager;
this.trafficRouter = trafficRouter;
}
@PostMapping("/deployments")
public ResponseEntity<CanaryDeployment> startDeployment(
@RequestBody StartDeploymentRequest request) {
CanaryDeployment deployment = deploymentManager.startDeployment(
request.getServiceName(),
request.getBaselineVersion(),
request.getCanaryVersion(),
request.getMetadata()
);
return ResponseEntity.accepted().body(deployment);
}
@GetMapping("/deployments/{deploymentId}")
public ResponseEntity<CanaryDeployment> getDeployment(
@PathVariable String deploymentId) {
return deploymentManager.getDeployment(deploymentId)
.map(ResponseEntity::ok)
.orElse(ResponseEntity.notFound().build());
}
@PostMapping("/deployments/{deploymentId}/promote")
public ResponseEntity<Void> promoteDeployment(@PathVariable String deploymentId) {
deploymentManager.getDeployment(deploymentId).ifPresent(deployment -> {
trafficRouter.promoteDeployment(deployment.getServiceName());
deployment.setStatus(CanaryStatus.PROMOTED);
});
return ResponseEntity.accepted().build();
}
@PostMapping("/deployments/{deploymentId}/rollback")
public ResponseEntity<Void> rollbackDeployment(@PathVariable String deploymentId) {
deploymentManager.getDeployment(deploymentId).ifPresent(deployment -> {
trafficRouter.rollbackDeployment(deployment.getServiceName());
deployment.setStatus(CanaryStatus.ROLLED_BACK);
});
return ResponseEntity.accepted().build();
}
}
@Data
class StartDeploymentRequest {
@NotBlank
private String serviceName;
@NotBlank
private String baselineVersion;
@NotBlank
private String canaryVersion;
private Map<String, Object> metadata = new HashMap<>();
}
7. Service-Level Integration
@Service
public class OrderService {
private final CanaryTrafficRouter trafficRouter;
private final OrderRepository baselineRepository;
private final OrderRepository canaryRepository;
public OrderService(CanaryTrafficRouter trafficRouter,
@Qualifier("baselineOrderRepository") OrderRepository baselineRepository,
@Qualifier("canaryOrderRepository") OrderRepository canaryRepository) {
this.trafficRouter = trafficRouter;
this.baselineRepository = baselineRepository;
this.canaryRepository = canaryRepository;
}
public Order findOrder(String orderId) {
return trafficRouter.route("order-service", "findOrder",
() -> baselineRepository.findOrder(orderId),
() -> canaryRepository.findOrder(orderId)
);
}
public Order createOrder(Order order) {
return trafficRouter.route("order-service", "createOrder",
() -> baselineRepository.save(order),
() -> canaryRepository.save(order)
);
}
}
8. Configuration
application.yml:
canary: enabled: true traffic-percentage: 5.0 analysis-window: 10m minimum-requests: 100 rollback: auto-rollback: true error-rate-threshold: 2.0 p99-latency-threshold: 500.0 management: endpoints: web: exposure: include: health,metrics,prometheus metrics: export: prometheus: enabled: true
Testing the Framework
@SpringBootTest
@TestPropertySource(properties = {
"canary.enabled=true",
"canary.traffic-percentage=50.0"
})
class CanaryTrafficRouterTest {
@Autowired
private CanaryTrafficRouter trafficRouter;
@Mock
private MeterRegistry meterRegistry;
@Test
void testTrafficRouting() {
// Given
Supplier<String> baseline = () -> "baseline-result";
Supplier<String> canary = () -> "canary-result";
// When
String result = trafficRouter.route("test-service", "testOp", baseline, canary);
// Then - should route to one of the implementations
assertThat(result).isIn("baseline-result", "canary-result");
}
}
Best Practices
- Start Small: Begin with 1-5% traffic and gradually increase
- Monitor Business Metrics: Track conversion rates, revenue, etc.
- Set Appropriate Timeouts: Canary analysis should complete quickly
- Implement Manual Overrides: Allow operators to intervene
- Log Thoroughly: Maintain audit trails of canary decisions
- Test Rollback Procedures: Ensure rollbacks work smoothly
Conclusion
Building a canary testing framework in Java enables safe, gradual deployment of new versions with automated health checks and rollbacks. This framework provides:
- Traffic splitting between baseline and canary versions
- Comprehensive metrics collection for performance monitoring
- Automated analysis based on configurable thresholds
- Gradual traffic increase for successful canaries
- Automatic rollback for failing deployments
By implementing this canary testing framework, you can significantly reduce deployment risks and build confidence in your release process while maintaining system stability and user experience.