Network Partition Testing in Java: Building Resilient Distributed Systems

Network partition testing (also known as split-brain testing) is crucial for ensuring distributed systems can handle network failures gracefully. It involves simulating network failures between system components to verify fault tolerance and recovery mechanisms.

Core Concepts

What are Network Partitions?

  • Network Isolation: When network segments become disconnected
  • Split-Brain: Multiple components think they're the primary/leader
  • Partition Tolerance: System's ability to continue operating during partitions

CAP Theorem Implications

  • Consistency: All nodes see the same data
  • Availability: Every request receives a response
  • Partition Tolerance: System continues despite network partitions

Testing Approaches and Tools

1. Using Toxiproxy for Network Testing

Dependencies

<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>toxiproxy</artifactId>
<version>1.19.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>eu.rekawek.toxiproxy</groupId>
<artifactId>toxiproxy-java</artifactId>
<version>2.1.7</version>
<scope>test</scope>
</dependency>

Example 1: Basic Toxiproxy Testing

import eu.rekawek.toxiproxy.Proxy;
import eu.rekawek.toxiproxy.ToxiproxyClient;
import eu.rekawek.toxiproxy.model.ToxicDirection;
import org.junit.jupiter.api.Test;
import org.testcontainers.containers.GenericContainer;
import org.testcontainers.containers.Network;
import org.testcontainers.containers.ToxiproxyContainer;
import org.testcontainers.junit.jupiter.Container;
import org.testcontainers.junit.jupiter.Testcontainers;
import java.io.IOException;
@Testcontainers
public class ToxiproxyNetworkPartitionTest {
private static final Network network = Network.newNetwork();
@Container
private static final ToxiproxyContainer toxiproxy = new ToxiproxyContainer("ghcr.io/shopify/toxiproxy:2.5.0")
.withNetwork(network);
@Container
private static final GenericContainer<?> redis = new GenericContainer<>("redis:7-alpine")
.withNetwork(network)
.withNetworkAliases("redis")
.withExposedPorts(6379);
@Test
void testRedisConnectionDuringNetworkPartition() throws IOException {
// Create Toxiproxy client
ToxiproxyClient client = new ToxiproxyClient(toxiproxy.getHost(), toxiproxy.getControlPort());
// Create proxy for Redis
Proxy proxy = client.createProxy("redis-proxy", 
"0.0.0.0:8666", 
"redis:6379");
// Get proxied Redis connection details
String redisHost = toxiproxy.getHost();
int redisPort = toxiproxy.getMappedPort(8666);
// Test normal connection
RedisClient redisClient = new RedisClient(redisHost, redisPort);
assertThat(redisClient.ping()).isTrue();
// Simulate network partition - add latency and packet loss
proxy.toxics()
.latency("network-latency", ToxicDirection.DOWNSTREAM, 5000) // 5 seconds latency
.setJitter(1000);
proxy.toxics()
.timeout("connection-timeout", ToxicDirection.DOWNSTREAM, 10000); // 10 seconds timeout
// Test behavior during partition
assertThatThrownBy(() -> redisClient.set("key", "value"))
.isInstanceOf(RedisConnectionException.class);
// Remove network issues
proxy.toxics().get("network-latency").remove();
proxy.toxics().get("connection-timeout").remove();
// Test recovery
await().atMost(30, TimeUnit.SECONDS)
.untilAsserted(() -> {
assertThat(redisClient.ping()).isTrue();
redisClient.set("key", "value");
assertThat(redisClient.get("key")).isEqualTo("value");
});
}
}

2. Custom Network Partition Simulator

Example 2: Java-based Network Partition Simulator

@Component
@Slf4j
public class NetworkPartitionSimulator {
private final Map<String, PartitionRule> activePartitions = new ConcurrentHashMap<>();
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(5);
public static class PartitionRule {
private final String sourceComponent;
private final String targetComponent;
private final PartitionType type;
private final Duration duration;
private final double packetLossPercentage;
private final Duration latency;
public PartitionRule(String sourceComponent, String targetComponent, 
PartitionType type, Duration duration) {
this(sourceComponent, targetComponent, type, duration, 0.0, Duration.ZERO);
}
public PartitionRule(String sourceComponent, String targetComponent,
PartitionType type, Duration duration,
double packetLossPercentage, Duration latency) {
this.sourceComponent = sourceComponent;
this.targetComponent = targetComponent;
this.type = type;
this.duration = duration;
this.packetLossPercentage = packetLossPercentage;
this.latency = latency;
}
// getters
}
public enum PartitionType {
COMPLETE,    // No communication
LATENCY,     // Delayed communication
PACKET_LOSS, // Random packet loss
UNRELIABLE   // Combination of issues
}
@PreDestroy
public void cleanup() {
scheduler.shutdown();
try {
if (!scheduler.awaitTermination(30, TimeUnit.SECONDS)) {
scheduler.shutdownNow();
}
} catch (InterruptedException e) {
scheduler.shutdownNow();
Thread.currentThread().interrupt();
}
}
public String createPartition(PartitionRule rule) {
String partitionId = UUID.randomUUID().toString();
activePartitions.put(partitionId, rule);
log.info("Created network partition: {} between {} and {}", 
partitionId, rule.getSourceComponent(), rule.getTargetComponent());
// Schedule automatic removal if duration is specified
if (!rule.getDuration().isZero()) {
scheduler.schedule(() -> removePartition(partitionId), 
rule.getDuration().toMillis(), TimeUnit.MILLISECONDS);
}
return partitionId;
}
public void removePartition(String partitionId) {
PartitionRule removed = activePartitions.remove(partitionId);
if (removed != null) {
log.info("Removed network partition: {} between {} and {}", 
partitionId, removed.getSourceComponent(), removed.getTargetComponent());
}
}
public boolean shouldBlockCommunication(String fromComponent, String toComponent) {
return activePartitions.values().stream()
.anyMatch(rule -> matchesRule(rule, fromComponent, toComponent));
}
public boolean shouldDelayCommunication(String fromComponent, String toComponent) {
return activePartitions.values().stream()
.filter(rule -> matchesRule(rule, fromComponent, toComponent))
.anyMatch(rule -> rule.getType() == PartitionType.LATENCY || 
rule.getType() == PartitionType.UNRELIABLE);
}
public Duration getCommunicationDelay(String fromComponent, String toComponent) {
return activePartitions.values().stream()
.filter(rule -> matchesRule(rule, fromComponent, toComponent))
.filter(rule -> rule.getLatency() != null && !rule.getLatency().isZero())
.map(PartitionRule::getLatency)
.findFirst()
.orElse(Duration.ZERO);
}
public boolean shouldDropPacket(String fromComponent, String toComponent) {
return activePartitions.values().stream()
.filter(rule -> matchesRule(rule, fromComponent, toComponent))
.filter(rule -> rule.getPacketLossPercentage() > 0.0)
.anyMatch(rule -> Math.random() < rule.getPacketLossPercentage() / 100.0);
}
private boolean matchesRule(PartitionRule rule, String from, String to) {
return (rule.getSourceComponent().equals(from) && rule.getTargetComponent().equals(to)) ||
(rule.getSourceComponent().equals(to) && rule.getTargetComponent().equals(from));
}
public List<PartitionRule> getActivePartitions() {
return new ArrayList<>(activePartitions.values());
}
public void clearAllPartitions() {
log.info("Clearing all network partitions");
activePartitions.clear();
}
}
// HTTP Client with Partition Awareness
@Component
@Slf4j
public class PartitionAwareHttpClient {
private final NetworkPartitionSimulator partitionSimulator;
private final RestTemplate restTemplate;
private final ObjectMapper objectMapper;
public PartitionAwareHttpClient(NetworkPartitionSimulator partitionSimulator,
RestTemplate restTemplate,
ObjectMapper objectMapper) {
this.partitionSimulator = partitionSimulator;
this.restTemplate = restTemplate;
this.objectMapper = objectMapper;
}
public <T> T executeWithPartitionAwareness(String serviceName, String url, 
HttpMethod method, Object request,
Class<T> responseType) {
String currentService = getCurrentServiceName();
// Check if communication should be blocked
if (partitionSimulator.shouldBlockCommunication(currentService, serviceName)) {
log.warn("Network partition blocked communication from {} to {}", 
currentService, serviceName);
throw new NetworkPartitionException(
"Network partition prevents communication to " + serviceName);
}
// Check for packet loss
if (partitionSimulator.shouldDropPacket(currentService, serviceName)) {
log.warn("Packet loss simulated from {} to {}", currentService, serviceName);
throw new NetworkPartitionException(
"Packet loss simulated for communication to " + serviceName);
}
// Apply latency if needed
Duration delay = partitionSimulator.getCommunicationDelay(currentService, serviceName);
if (!delay.isZero()) {
log.info("Applying network latency of {}ms from {} to {}", 
delay.toMillis(), currentService, serviceName);
try {
Thread.sleep(delay.toMillis());
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
throw new NetworkPartitionException("Latency simulation interrupted");
}
}
// Execute the actual HTTP call
try {
HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.APPLICATION_JSON);
HttpEntity<Object> entity = new HttpEntity<>(request, headers);
ResponseEntity<String> response = restTemplate.exchange(
url, method, entity, String.class);
return objectMapper.readValue(response.getBody(), responseType);
} catch (ResourceAccessException e) {
log.error("Network error calling {}: {}", serviceName, e.getMessage());
throw new NetworkPartitionException("Network error: " + e.getMessage(), e);
} catch (Exception e) {
log.error("Unexpected error calling {}: {}", serviceName, e.getMessage());
throw new RuntimeException("Service call failed", e);
}
}
private String getCurrentServiceName() {
// In real implementation, this would come from configuration
return System.getProperty("service.name", "unknown-service");
}
}
// Custom Exception
class NetworkPartitionException extends RuntimeException {
public NetworkPartitionException(String message) {
super(message);
}
public NetworkPartitionException(String message, Throwable cause) {
super(message, cause);
}
}

3. Database Partition Testing

Example 3: Database Connection Partition Testing

@SpringBootTest
@Testcontainers
@Slf4j
public class DatabasePartitionTest {
@Container
private static final PostgreSQLContainer<?> postgres = 
new PostgreSQLContainer<>("postgres:15-alpine");
@Container
private static final ToxiproxyContainer toxiproxy = 
new ToxiproxyContainer("ghcr.io/shopify/toxiproxy:2.5.0");
private static Proxy dbProxy;
@Autowired
private DataSource dataSource;
@BeforeAll
static void setup() throws Exception {
// Create proxy for database
ToxiproxyClient client = new ToxiproxyClient(
toxiproxy.getHost(), toxiproxy.getControlPort());
dbProxy = client.createProxy("postgres-proxy", 
"0.0.0.0:8666", 
postgres.getHost() + ":" + postgres.getFirstMappedPort());
}
@DynamicPropertySource
static void configureProperties(DynamicPropertyRegistry registry) {
// Use proxied database connection
registry.add("spring.datasource.url", () -> 
String.format("jdbc:postgresql://%s:%d/%s",
toxiproxy.getHost(),
toxiproxy.getMappedPort(8666),
postgres.getDatabaseName()));
registry.add("spring.datasource.username", postgres::getUsername);
registry.add("spring.datasource.password", postgres::getPassword);
}
@Test
void testDatabaseOperationsDuringNetworkPartition() throws Exception {
// Setup test data
JdbcTemplate jdbcTemplate = new JdbcTemplate(dataSource);
jdbcTemplate.execute("CREATE TABLE IF NOT EXISTS test_table (id SERIAL PRIMARY KEY, data VARCHAR)");
jdbcTemplate.execute("INSERT INTO test_table (data) VALUES ('test-data')");
// Test normal operation
String result = jdbcTemplate.queryForObject(
"SELECT data FROM test_table WHERE id = 1", String.class);
assertThat(result).isEqualTo("test-data");
// Simulate network partition - high latency and timeout
dbProxy.toxics()
.latency("db-latency", ToxicDirection.DOWNSTREAM, 10000) // 10 seconds
.setJitter(2000);
dbProxy.toxics()
.timeout("db-timeout", ToxicDirection.DOWNSTREAM, 5000); // 5 seconds timeout
// Test behavior during partition
long startTime = System.currentTimeMillis();
assertThatThrownBy(() -> {
jdbcTemplate.queryForObject(
"SELECT data FROM test_table WHERE id = 1", String.class);
}).isInstanceOf(DataAccessResourceFailureException.class);
long duration = System.currentTimeMillis() - startTime;
assertThat(duration).isGreaterThanOrEqualTo(4900); // Should timeout around 5 seconds
// Remove network issues
dbProxy.toxics().get("db-latency").remove();
dbProxy.toxics().get("db-timeout").remove();
// Test recovery
await().atMost(30, TimeUnit.SECONDS)
.untilAsserted(() -> {
String recoveredResult = jdbcTemplate.queryForObject(
"SELECT data FROM test_table WHERE id = 1", String.class);
assertThat(recoveredResult).isEqualTo("test-data");
});
}
@Test
void testTransactionBehaviorDuringPartition() throws Exception {
TransactionTemplate transactionTemplate = 
new TransactionTemplate(transactionManager);
// Simulate partition during transaction commit
dbProxy.toxics()
.latency("commit-latency", ToxicDirection.DOWNSTREAM, 15000);
assertThatThrownBy(() -> {
transactionTemplate.execute(status -> {
JdbcTemplate jdbcTemplate = new JdbcTemplate(dataSource);
jdbcTemplate.execute("INSERT INTO test_table (data) VALUES ('transaction-test')");
// This should fail due to timeout
return null;
});
}).isInstanceOf(TransactionTimedOutException.class);
// Verify transaction was rolled back
dbProxy.toxics().get("commit-latency").remove();
Integer count = new JdbcTemplate(dataSource).queryForObject(
"SELECT COUNT(*) FROM test_table WHERE data = 'transaction-test'", Integer.class);
assertThat(count).isZero(); // Transaction should have rolled back
}
}

4. Microservices Partition Testing

Example 4: Spring Cloud Microservices Partition Testing

@SpringBootTest
@ActiveProfiles("test")
@Testcontainers
@Slf4j
public class MicroservicesPartitionTest {
@Container
private static final ToxiproxyContainer toxiproxy = 
new ToxiproxyContainer("ghcr.io/shopify/toxiproxy:2.5.0");
@Container
private static final GenericContainer<?> userService = 
new GenericContainer<>("user-service:latest")
.withNetwork(Network.SHARED)
.withNetworkAliases("user-service")
.withExposedPorts(8080);
@Container
private static final GenericContainer<?> orderService = 
new GenericContainer<>("order-service:latest")
.withNetwork(Network.SHARED)
.withNetworkAliases("order-service")
.withExposedPorts(8080);
@Container
private static final GenericContainer<?> paymentService = 
new GenericContainer<>("payment-service:latest")
.withNetwork(Network.SHARED)
.withNetworkAliases("payment-service")
.withExposedPorts(8080);
private static Proxy userServiceProxy;
private static Proxy orderServiceProxy;
private static Proxy paymentServiceProxy;
@BeforeAll
static void setupProxies() throws Exception {
ToxiproxyClient client = new ToxiproxyClient(
toxiproxy.getHost(), toxiproxy.getControlPort());
userServiceProxy = client.createProxy("user-service-proxy",
"0.0.0.0:8667", "user-service:8080");
orderServiceProxy = client.createProxy("order-service-proxy",
"0.0.0.0:8668", "order-service:8080");
paymentServiceProxy = client.createProxy("payment-service-proxy",
"0.0.0.0:8669", "payment-service:8080");
}
@Test
void testOrderProcessingDuringServicePartition() throws Exception {
// Create partition between order service and payment service
paymentServiceProxy.toxics()
.latency("payment-partition", ToxicDirection.DOWNSTREAM, 30000) // 30 seconds
.timeout("payment-timeout", ToxicDirection.DOWNSTREAM, 10000); // 10 seconds timeout
// Test order creation
OrderServiceClient orderClient = new OrderServiceClient(
"http://" + toxiproxy.getHost() + ":" + toxiproxy.getMappedPort(8668));
CreateOrderRequest orderRequest = new CreateOrderRequest(
"user-123", List.of("item-1", "item-2"), new BigDecimal("99.99"));
// This should fail due to payment service being unavailable
assertThatThrownBy(() -> orderClient.createOrder(orderRequest))
.isInstanceOf(ServiceUnavailableException.class);
// Verify circuit breaker pattern
await().atMost(10, TimeUnit.SECONDS)
.until(() -> orderClient.getCircuitBreakerState() == CircuitBreakerState.OPEN);
// Remove partition
paymentServiceProxy.toxics().get("payment-partition").remove();
paymentServiceProxy.toxics().get("payment-timeout").remove();
// Verify recovery and circuit breaker closing
await().atMost(30, TimeUnit.SECONDS)
.until(() -> orderClient.getCircuitBreakerState() == CircuitBreakerState.CLOSED);
// Test successful order creation after recovery
OrderResponse order = orderClient.createOrder(orderRequest);
assertThat(order.getStatus()).isEqualTo(OrderStatus.COMPLETED);
}
@Test
void testDataConsistencyDuringPartition() throws Exception {
// Simulate partition between services and their databases
userServiceProxy.toxics()
.latency("user-db-partition", ToxicDirection.DOWNSTREAM, 15000);
// Try to update user profile
UserServiceClient userClient = new UserServiceClient(
"http://" + toxiproxy.getHost() + ":" + toxiproxy.getMappedPort(8667));
UpdateUserRequest updateRequest = new UpdateUserRequest("user-456", "[email protected]");
// This should timeout
assertThatThrownBy(() -> userClient.updateUser(updateRequest))
.isInstanceOf(ServiceUnavailableException.class);
// Remove partition
userServiceProxy.toxics().get("user-db-partition").remove();
// Verify eventual consistency
await().atMost(30, TimeUnit.SECONDS)
.untilAsserted(() -> {
UserProfile profile = userClient.getUser("user-456");
assertThat(profile.getEmail()).isEqualTo("[email protected]");
});
}
}
// Service Client with Resilience Patterns
@Component
@Slf4j
public class OrderServiceClient {
private final RestTemplate restTemplate;
private final CircuitBreaker circuitBreaker;
private final Retry retry;
private final String serviceUrl;
public OrderServiceClient(@Value("${order.service.url}") String serviceUrl,
RestTemplate restTemplate,
CircuitBreakerRegistry circuitBreakerRegistry,
RetryRegistry retryRegistry) {
this.serviceUrl = serviceUrl;
this.restTemplate = restTemplate;
this.circuitBreaker = circuitBreakerRegistry.circuitBreaker("order-service");
this.retry = retryRegistry.retry("order-service");
}
public OrderResponse createOrder(CreateOrderRequest request) {
return circuitBreaker.executeSupplier(() ->
retry.executeSupplier(() -> {
try {
ResponseEntity<OrderResponse> response = restTemplate.postForEntity(
serviceUrl + "/api/orders", request, OrderResponse.class);
return response.getBody();
} catch (ResourceAccessException e) {
log.warn("Network error calling order service: {}", e.getMessage());
throw new ServiceUnavailableException("Order service unavailable", e);
}
})
);
}
public CircuitBreakerState getCircuitBreakerState() {
return circuitBreaker.getState();
}
}

5. Advanced Partition Scenarios

Example 5: Chaos Engineering with Custom Partitions

@Service
@Slf4j
public class ChaosEngineeringService {
private final NetworkPartitionSimulator partitionSimulator;
private final ScheduledExecutorService chaosExecutor;
private final Random random = new Random();
public ChaosEngineeringService(NetworkPartitionSimulator partitionSimulator) {
this.partitionSimulator = partitionSimulator;
this.chaosExecutor = Executors.newScheduledThreadPool(3);
}
public void startRandomPartitions(ChaosConfig config) {
log.info("Starting chaos engineering with config: {}", config);
// Schedule random partitions
chaosExecutor.scheduleAtFixedRate(() -> {
if (random.nextDouble() < config.getPartitionProbability()) {
createRandomPartition(config);
}
}, 0, config.getIntervalSeconds(), TimeUnit.SECONDS);
}
public void stopChaos() {
log.info("Stopping chaos engineering");
chaosExecutor.shutdown();
partitionSimulator.clearAllPartitions();
}
private void createRandomPartition(ChaosConfig config) {
List<String> services = config.getTargetServices();
if (services.size() < 2) return;
// Pick two random services to partition
String service1 = services.get(random.nextInt(services.size()));
String service2;
do {
service2 = services.get(random.nextInt(services.size()));
} while (service1.equals(service2));
PartitionType type = getRandomPartitionType();
Duration duration = Duration.ofSeconds(
config.getMinPartitionDuration() + 
random.nextInt(config.getMaxPartitionDuration() - config.getMinPartitionDuration()));
double packetLoss = type == PartitionType.PACKET_LOSS || type == PartitionType.UNRELIABLE ?
random.nextDouble() * 50.0 : 0.0; // Up to 50% packet loss
Duration latency = type == PartitionType.LATENCY || type == PartitionType.UNRELIABLE ?
Duration.ofMillis(1000 + random.nextInt(9000)) : Duration.ZERO; // 1-10 seconds latency
PartitionRule rule = new PartitionRule(service1, service2, type, duration, packetLoss, latency);
partitionSimulator.createPartition(rule);
log.info("Created random partition: {} <-> {} (type: {}, duration: {}s)", 
service1, service2, type, duration.getSeconds());
}
private PartitionType getRandomPartitionType() {
PartitionType[] types = PartitionType.values();
return types[random.nextInt(types.length)];
}
@PreDestroy
public void cleanup() {
stopChaos();
}
}
// Chaos Configuration
@Data
@ConfigurationProperties(prefix = "chaos")
public class ChaosConfig {
private boolean enabled = false;
private double partitionProbability = 0.1; // 10% chance
private int intervalSeconds = 60; // Check every minute
private int minPartitionDuration = 30; // seconds
private int maxPartitionDuration = 300; // seconds
private List<String> targetServices = List.of();
}
// Chaos Controller for Testing
@RestController
@RequestMapping("/api/chaos")
@Slf4j
@ConditionalOnProperty(name = "chaos.enabled", havingValue = "true")
public class ChaosController {
private final ChaosEngineeringService chaosService;
private final NetworkPartitionSimulator partitionSimulator;
public ChaosController(ChaosEngineeringService chaosService,
NetworkPartitionSimulator partitionSimulator) {
this.chaosService = chaosService;
this.partitionSimulator = partitionSimulator;
}
@PostMapping("/start")
public ResponseEntity<String> startChaos(@RequestBody ChaosConfig config) {
chaosService.startRandomPartitions(config);
return ResponseEntity.ok("Chaos engineering started");
}
@PostMapping("/stop")
public ResponseEntity<String> stopChaos() {
chaosService.stopChaos();
return ResponseEntity.ok("Chaos engineering stopped");
}
@PostMapping("/partitions")
public ResponseEntity<String> createPartition(@RequestBody PartitionRequest request) {
PartitionRule rule = new PartitionRule(
request.getSourceService(),
request.getTargetService(),
request.getType(),
Duration.ofSeconds(request.getDurationSeconds()),
request.getPacketLossPercentage(),
Duration.ofMillis(request.getLatencyMs())
);
String partitionId = partitionSimulator.createPartition(rule);
return ResponseEntity.ok("Partition created: " + partitionId);
}
@DeleteMapping("/partitions/{partitionId}")
public ResponseEntity<String> removePartition(@PathVariable String partitionId) {
partitionSimulator.removePartition(partitionId);
return ResponseEntity.ok("Partition removed: " + partitionId);
}
@GetMapping("/partitions")
public ResponseEntity<List<PartitionRule>> getActivePartitions() {
return ResponseEntity.ok(partitionSimulator.getActivePartitions());
}
}
@Data
class PartitionRequest {
private String sourceService;
private String targetService;
private NetworkPartitionSimulator.PartitionType type;
private int durationSeconds;
private double packetLossPercentage;
private long latencyMs;
}

6. Monitoring and Observability

Example 6: Partition-Aware Monitoring

@Component
@Slf4j
public class PartitionMonitoringService {
private final MeterRegistry meterRegistry;
private final NetworkPartitionSimulator partitionSimulator;
private final Map<String, Timer> serviceCallTimers = new ConcurrentHashMap<>();
private final Counter partitionEventsCounter;
private final Counter failedCallsCounter;
public PartitionMonitoringService(MeterRegistry meterRegistry,
NetworkPartitionSimulator partitionSimulator) {
this.meterRegistry = meterRegistry;
this.partitionSimulator = partitionSimulator;
this.partitionEventsCounter = Counter.builder("network.partition.events")
.description("Number of network partition events")
.register(meterRegistry);
this.failedCallsCounter = Counter.builder("service.calls.failed")
.description("Number of failed service calls")
.register(meterRegistry);
}
public void recordServiceCall(String fromService, String toService, 
Duration duration, boolean success) {
String timerName = "service.call.duration";
Timer timer = serviceCallTimers.computeIfAbsent(fromService + "." + toService, 
key -> Timer.builder(timerName)
.tag("from", fromService)
.tag("to", toService)
.register(meterRegistry));
timer.record(duration);
if (!success) {
failedCallsCounter.increment();
}
// Check if partition affected this call
if (partitionSimulator.shouldBlockCommunication(fromService, toService) {
partitionEventsCounter.increment();
log.warn("Service call {} -> {} failed due to network partition", 
fromService, toService);
}
}
public void recordPartitionEvent(String source, String target, 
NetworkPartitionSimulator.PartitionType type) {
Counter.builder("network.partition.created")
.tag("source", source)
.tag("target", target)
.tag("type", type.name())
.register(meterRegistry)
.increment();
}
public Map<String, Object> getPartitionMetrics() {
Map<String, Object> metrics = new HashMap<>();
metrics.put("activePartitions", partitionSimulator.getActivePartitions().size());
metrics.put("totalPartitionEvents", partitionEventsCounter.count());
metrics.put("failedServiceCalls", failedCallsCounter.count());
// Add service-specific metrics
serviceCallTimers.forEach((servicePair, timer) -> {
Timer.Snapshot snapshot = timer.takeSnapshot();
metrics.put(servicePair + ".mean", snapshot.mean());
metrics.put(servicePair + ".max", snapshot.max());
});
return metrics;
}
}
// Health Check with Partition Awareness
@Component
@Slf4j
public class PartitionAwareHealthIndicator implements HealthIndicator {
private final NetworkPartitionSimulator partitionSimulator;
private final List<ServiceHealthChecker> serviceHealthCheckers;
public PartitionAwareHealthIndicator(NetworkPartitionSimulator partitionSimulator,
List<ServiceHealthChecker> serviceHealthCheckers) {
this.partitionSimulator = partitionSimulator;
this.serviceHealthCheckers = serviceHealthCheckers;
}
@Override
public Health health() {
Map<String, Object> details = new HashMap<>();
boolean healthy = true;
// Check active partitions
List<NetworkPartitionSimulator.PartitionRule> activePartitions = 
partitionSimulator.getActivePartitions();
details.put("activePartitions", activePartitions.size());
if (!activePartitions.isEmpty()) {
healthy = false;
details.put("partitionDetails", activePartitions.stream()
.map(rule -> rule.getSourceComponent() + " <-> " + rule.getTargetComponent())
.collect(Collectors.toList()));
}
// Check service health considering partitions
for (ServiceHealthChecker checker : serviceHealthCheckers) {
String serviceName = checker.getServiceName();
boolean serviceHealthy = checker.isHealthy();
// If there's a partition to this service, mark it as degraded
if (partitionSimulator.shouldBlockCommunication("current-service", serviceName)) {
details.put(serviceName + ".status", "DEGRADED");
details.put(serviceName + ".reason", "Network partition");
healthy = false;
} else if (!serviceHealthy) {
details.put(serviceName + ".status", "DOWN");
healthy = false;
} else {
details.put(serviceName + ".status", "UP");
}
}
if (healthy) {
return Health.up().withDetails(details).build();
} else {
return Health.down().withDetails(details).build();
}
}
}

Best Practices for Network Partition Testing

1. Test Scenarios to Cover

public class NetworkPartitionTestScenarios {
/**
* Test scenarios for comprehensive partition testing
*/
public static final List<PartitionScenario> SCENARIOS = List.of(
// Basic partition scenarios
new PartitionScenario("complete-partition", 
"Complete network isolation between services",
PartitionType.COMPLETE, Duration.ofMinutes(5)),
// High latency scenarios  
new PartitionScenario("high-latency",
"High network latency (5-10 seconds)",
PartitionType.LATENCY, Duration.ofMinutes(3)),
// Packet loss scenarios
new PartitionScenario("packet-loss",
"Random packet loss (10-30%)",
PartitionType.PACKET_LOSS, Duration.ofMinutes(2)),
// Unreliable network scenarios
new PartitionScenario("unreliable-network",
"Combination of latency and packet loss",
PartitionType.UNRELIABLE, Duration.ofMinutes(4)),
// Cascading failure scenarios
new PartitionScenario("cascading-failure",
"Partition that triggers cascading failures",
PartitionType.COMPLETE, Duration.ofMinutes(10))
);
@Data
@AllArgsConstructor
public static class PartitionScenario {
private String name;
private String description;
private NetworkPartitionSimulator.PartitionType type;
private Duration duration;
}
}

2. Automated Partition Testing Pipeline

@Component
@Slf4j
public class AutomatedPartitionTesting {
private final NetworkPartitionSimulator partitionSimulator;
private final TestScenarioExecutor testExecutor;
private final ResultsCollector resultsCollector;
public AutomatedPartitionTesting(NetworkPartitionSimulator partitionSimulator,
TestScenarioExecutor testExecutor,
ResultsCollector resultsCollector) {
this.partitionSimulator = partitionSimulator;
this.testExecutor = testExecutor;
this.resultsCollector = resultsCollector;
}
public void runPartitionTestSuite() {
log.info("Starting automated partition test suite");
for (PartitionTestScenario scenario : getTestScenarios()) {
runPartitionTest(scenario);
}
log.info("Partition test suite completed");
generateTestReport();
}
private void runPartitionTest(PartitionTestScenario scenario) {
log.info("Running partition test: {}", scenario.getName());
try {
// Setup partition
String partitionId = partitionSimulator.createPartition(scenario.getPartitionRule());
// Run tests during partition
TestResults results = testExecutor.executeTests(scenario.getTests());
// Remove partition
partitionSimulator.removePartition(partitionId);
// Verify recovery
TestResults recoveryResults = testExecutor.executeTests(scenario.getRecoveryTests());
// Collect results
resultsCollector.recordScenarioResults(scenario, results, recoveryResults);
} catch (Exception e) {
log.error("Partition test failed: {}", scenario.getName(), e);
resultsCollector.recordTestFailure(scenario, e);
}
}
private List<PartitionTestScenario> getTestScenarios() {
// Define comprehensive test scenarios
return List.of(
new PartitionTestScenario("database-partition",
"Partition between application and database",
new NetworkPartitionSimulator.PartitionRule("app", "database", 
PartitionType.COMPLETE, Duration.ofMinutes(2)),
List.of("testDatabaseOperations", "testTransactionHandling"),
List.of("testRecovery", "testDataConsistency")),
new PartitionTestScenario("service-mesh-partition",
"Partition between microservices",
new NetworkPartitionSimulator.PartitionRule("order-service", "payment-service",
PartitionType.UNRELIABLE, Duration.ofMinutes(3), 20.0, Duration.ofSeconds(5)),
List.of("testOrderProcessing", "testCircuitBreaker"),
List.of("testServiceRecovery", "testMessageReconciliation"))
);
}
private void generateTestReport() {
TestReport report = resultsCollector.generateReport();
log.info("Partition Test Report:\n{}", report);
// Alert if critical tests failed
if (report.getFailureCount() > 0) {
log.error("CRITICAL: {} partition tests failed", report.getFailureCount());
}
}
}

Conclusion

Network partition testing is essential for building resilient distributed systems:

Key Testing Strategies:

  • Toxiproxy: For realistic network failure simulation
  • Custom Simulators: For fine-grained control over partition behavior
  • Chaos Engineering: For automated, random partition testing
  • Monitoring Integration: For observability during partitions

Critical Test Scenarios:

  1. Complete Isolation: Verify graceful degradation
  2. High Latency: Test timeout and retry mechanisms
  3. Packet Loss: Validate message reliability
  4. Cascading Failures: Ensure failure containment
  5. Recovery Testing: Verify system restoration

Best Practices:

  • Start with controlled environments before production
  • Use feature flags to enable/disable partition testing
  • Monitor system behavior comprehensively during tests
  • Test recovery mechanisms thoroughly
  • Document expected behaviors and failure modes

Network partition testing helps build systems that can withstand real-world network failures, ensuring business continuity and customer satisfaction even under adverse conditions.

Leave a Reply

Your email address will not be published. Required fields are marked *


Macro Nepal Helper