End-to-End Visibility: Mastering Distributed Tracing with Spring Cloud Sleuth and Zipkin

In modern microservices architectures, a single user request can traverse dozens of services, making debugging and performance monitoring incredibly challenging. Distributed tracing provides the solution by giving you complete visibility into the flow of requests across service boundaries. This article explores how to implement distributed tracing in Java using Spring Cloud Sleuth for trace instrumentation and Zipkin for visualization and analysis.


The Challenge: Debugging in Microservices

Consider a simple e-commerce request flow:

User Request → API Gateway → Auth Service → Product Service → Order Service → Payment Service → Notification Service

Without distributed tracing:

  • No correlation between logs across services
  • Impossible to track the entire request journey
  • Difficult to identify performance bottlenecks
  • Debugging requires manual log correlation across multiple systems

With distributed tracing:

  • Complete visibility of the request flow
  • Correlation IDs that span all services
  • Performance metrics for each service call
  • Visual representation of service dependencies

Core Concepts: Trace, Span, and Annotations

Trace: A complete request journey from start to end

  • Contains a unique Trace ID
  • Represents the entire workflow

Span: A single operation within a trace

  • Contains a unique Span ID
  • Has a parent span (except the root span)
  • Represents work done by a single service

Annotations: Key events within a span

  • cs (Client Sent): When a request is initiated
  • sr (Server Received): When a request is received
  • ss (Server Sent): When a response is sent
  • cr (Client Received): When a response is received

Visual Representation:

Trace: 5aaf3e (entire request)
├── Span A: API Gateway (root)
│   ├── cs: Gateway starts request
│   └── cr: Gateway receives response
├── Span B: Auth Service (child of A)
│   ├── sr: Auth receives request  
│   └── ss: Auth sends response
└── Span C: Product Service (child of A)
├── sr: Product receives request
└── ss: Product sends response

Spring Cloud Sleuth: Automatic Instrumentation

Spring Cloud Sleuth automatically instruments your Spring applications to add trace and span IDs to logs and propagate them across service boundaries.

Key Features:

  • Automatic context propagation via HTTP headers
  • Integration with Spring ecosystem (Web, Feign, Messaging, etc.)
  • Log correlation with trace/span IDs
  • Multiple sampler strategies for controlling trace volume

Project Setup and Dependencies

Maven Dependencies:

<!-- Spring Boot Starter -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<!-- Spring Cloud Sleuth -->
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-sleuth</artifactId>
<version>3.1.9</version>
</dependency>
<!-- Zipkin Integration (optional) -->
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-sleuth-zipkin</artifactId>
<version>3.1.9</version>
</dependency>
<!-- For WebClient support -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-webflux</artifactId>
</dependency>
<!-- For Feign Client support -->
<dependency>
<groupId>org.springframework.cloud</groupId>
<artifactId>spring-cloud-starter-openfeign</artifactId>
</dependency>

Gradle Dependencies:

dependencies {
implementation 'org.springframework.boot:spring-boot-starter-web'
implementation 'org.springframework.cloud:spring-cloud-starter-sleuth:3.1.9'
implementation 'org.springframework.cloud:spring-cloud-sleuth-zipkin:3.1.9'
implementation 'org.springframework.boot:spring-boot-starter-webflux'
}

Basic Sleuth Configuration

application.yml:

spring:
application:
name: product-service  # Service name appears in traces
sleuth:
sampler:
probability: 1.0     # Sample 100% of requests (1.0 = 100%)
# Optional: Custom configuration
web:
enabled: true
feign:
enabled: true
messaging:
enabled: true
# Zipkin configuration (if using Zipkin)
zipkin:
base-url: http://localhost:9411
sender:
type: web            # Options: web, rabbit, kafka
logging:
pattern:
level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"

Sampler Configuration Options:

@Configuration
public class SleuthConfig {
// Sample all requests (for development)
@Bean
public Sampler alwaysSampler() {
return Sampler.ALWAYS_SAMPLE;
}
// Sample based on probability (for production)
@Bean 
public Sampler probabilitySampler() {
return ProbabilityBasedSampler.create(0.5f); // 50% sampling
}
// Custom sampler based on request attributes
@Bean
public Sampler customSampler() {
return new Sampler() {
@Override
public boolean isSampled(TraceContext traceContext) {
// Custom sampling logic
return Math.random() < 0.1; // 10% sampling
}
};
}
}

Service Implementation Examples

Product Service:

@SpringBootApplication
@RestController
@Slf4j
public class ProductServiceApplication {
public static void main(String[] args) {
SpringApplication.run(ProductServiceApplication.class, args);
}
@GetMapping("/products/{id}")
public Product getProduct(@PathVariable String id) {
log.info("Fetching product with ID: {}", id);
// Simulate business logic
if ("999".equals(id)) {
log.warn("Product {} not found", id);
throw new ProductNotFoundException("Product not found: " + id);
}
// Simulate database call
Product product = findProductInDatabase(id);
log.debug("Found product: {}", product.getName());
return product;
}
private Product findProductInDatabase(String id) {
// Simulate database latency
try {
Thread.sleep(100);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
return new Product(id, "Sample Product", 29.99);
}
}
class Product {
private String id;
private String name;
private double price;
// Constructors, getters, setters
public Product(String id, String name, double price) {
this.id = id;
this.name = name;
this.price = price;
}
// Getters and setters...
}

Order Service (with Feign Client):

@SpringBootApplication
@RestController
@EnableFeignClients
@Slf4j
public class OrderServiceApplication {
@Autowired
private ProductServiceClient productServiceClient;
@Autowired
private WebClient.Builder webClientBuilder;
public static void main(String[] args) {
SpringApplication.run(OrderServiceApplication.class, args);
}
@PostMapping("/orders")
public Order createOrder(@RequestBody OrderRequest request) {
log.info("Creating order for user: {}", request.getUserId());
// Method 1: Using Feign Client (automatically propagates trace headers)
Product product = productServiceClient.getProduct(request.getProductId());
log.debug("Retrieved product: {}", product.getName());
// Method 2: Using WebClient (manual header propagation)
Product product2 = webClientBuilder.build()
.get()
.uri("http://localhost:8081/products/{id}", request.getProductId())
.retrieve()
.bodyToMono(Product.class)
.block();
// Create order logic
Order order = new Order(UUID.randomUUID().toString(), request.getUserId(), product);
log.info("Order created: {}", order.getId());
return order;
}
}
// Feign Client interface
@FeignClient(name = "product-service", url = "http://localhost:8081")
interface ProductServiceClient {
@GetMapping("/products/{id}")
Product getProduct(@PathVariable String id);
}
class OrderRequest {
private String userId;
private String productId;
// Getters and setters...
}
class Order {
private String id;
private String userId;
private Product product;
private LocalDateTime createdAt;
public Order(String id, String userId, Product product) {
this.id = id;
this.userId = userId;
this.product = product;
this.createdAt = LocalDateTime.now();
}
// Getters and setters...
}

Manual Tracing with @NewSpan and @ContinueSpan

For custom business operations, you can manually create spans:

@Service
@Slf4j
public class OrderProcessingService {
private final Tracer tracer;
public OrderProcessingService(Tracer tracer) {
this.tracer = tracer;
}
// Automatically creates a new span
@NewSpan("process-payment")
public PaymentResult processPayment(Order order) {
log.info("Processing payment for order: {}", order.getId());
// Custom business logic
boolean success = chargeCreditCard(order);
// Add custom tags to the span
tracer.currentSpan().tag("order.amount", String.valueOf(order.getTotalAmount()));
tracer.currentSpan().tag("payment.success", String.valueOf(success));
if (!success) {
tracer.currentSpan().tag("error", true);
log.error("Payment failed for order: {}", order.getId());
}
return new PaymentResult(success, success ? "Payment processed" : "Payment failed");
}
// Continues existing span with custom name
@ContinueSpan(log = "validate-inventory")
public boolean validateInventory(Order order) {
log.debug("Validating inventory for product: {}", order.getProductId());
// Manual span creation for specific operations
Span inventorySpan = tracer.nextSpan().name("check-stock-level").start();
try (SpanInScope ws = tracer.withSpan(inventorySpan)) {
// Complex inventory checking logic
int stockLevel = checkStockLevel(order.getProductId());
inventorySpan.tag("stock.level", String.valueOf(stockLevel));
return stockLevel >= order.getQuantity();
} finally {
inventorySpan.end();
}
}
// Manual tracing without annotations
public void shipOrder(Order order) {
Span shippingSpan = tracer.nextSpan().name("ship-order").start();
try (SpanInScope ws = tracer.withSpan(shippingSpan)) {
log.info("Shipping order: {}", order.getId());
// Add baggage (custom context that propagates to downstream services)
shippingSpan.tag("shipping.carrier", "UPS");
shippingSpan.tag("order.weight", "2.5kg");
// Simulate shipping logic
arrangeShipping(order);
} catch (Exception e) {
shippingSpan.error(e); // Mark span as error
log.error("Shipping failed for order: {}", order.getId(), e);
throw e;
} finally {
shippingSpan.end();
}
}
private boolean chargeCreditCard(Order order) {
// Simulate payment processing
return Math.random() > 0.1; // 90% success rate
}
private int checkStockLevel(String productId) {
// Simulate stock check
return (int) (Math.random() * 100);
}
private void arrangeShipping(Order order) {
// Simulate shipping arrangement
try {
Thread.sleep(200);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
}

Setting Up Zipkin for Visualization

Running Zipkin with Docker:

# Quick start with Docker
docker run -d -p 9411:9411 --name zipkin openzipkin/zipkin
# With persistence (MySQL)
docker run -d -p 9411:9411 \
-e STORAGE_TYPE=mysql \
-e MYSQL_HOST=localhost \
-e MYSQL_TCP_PORT=3306 \
-e MYSQL_DB=zipkin \
-e MYSQL_USER=zipkin \
-e MYSQL_PASS=zipkin \
--name zipkin openzipkin/zipkin
# With RabbitMQ integration
docker run -d -p 9411:9411 \
-e RABBIT_URI=amqp://localhost \
--name zipkin openzipkin/zipkin

Alternative: Using Java

# Download and run
curl -sSL https://zipkin.io/quickstart.sh | bash -s
java -jar zipkin.jar

Zipkin Configuration:

spring:
zipkin:
base-url: http://localhost:9411
sender:
type: web  # Options: web, rabbit, kafka
# Optional: Service name mapping
service:
name: product-service
# Connection settings
connect-timeout: 5s
read-timeout: 10s
# Alternative: Using message broker (better for production)
rabbitmq:
host: localhost
port: 5672
username: guest
password: guest
# For Kafka transport
spring:
zipkin:
sender:
type: kafka
kafka:
bootstrap-servers: localhost:9092

Analyzing Traces in Zipkin

Key Zipkin Features:

  1. Service Dependency Diagram: Visual map of service interactions
  2. Trace Timeline: Detailed view of request flow with timing
  3. Span Details: Deep dive into individual operations
  4. Error Identification: Quickly spot failed requests
  5. Performance Analysis: Identify slow services and bottlenecks

Example Trace Analysis:

Order Creation Request (Trace ID: 7d3e4a)
├── API Gateway (15ms)
├── Auth Service (8ms) 
├── Product Service (45ms) ← BOTTLENECK!
│   └── Database Query (40ms)
├── Inventory Service (12ms)
└── Payment Service (25ms)

Custom Zipkin Queries:

  • serviceName:product-service and duration>100ms - Slow product service calls
  • http.path:/products/999 and error - Failed product requests
  • spanName:process-payment - Custom business spans

Advanced Configuration and Best Practices

Production Configuration:

spring:
sleuth:
sampler:
probability: 0.1  # Sample 10% in production
# Baggage configuration
baggage:
remote-fields: version,user-id,country-code
correlation:
enabled: true
fields: user-id,country-code
# Propagation options
propagation:
type: B3  # Options: B3, W3C (Trace Context)
zipkin:
base-url: ${ZIPKIN_URL:http://zipkin-prod:9411}
sender:
type: kafka  # Use message broker in production
logging:
level:
org.springframework.cloud.sleuth: INFO
brave: WARN

Custom Trace Configuration:

@Configuration
@Slf4j
public class TracingConfiguration {
@Bean
public CurrentTraceContext currentTraceContext() {
return CurrentTraceContext.newBuilder()
.addScopeDecorator(new MDCScopeDecorator())
.build();
}
@Bean 
public Tracing tracing(Sampler sampler) {
return Tracing.newBuilder()
.localServiceName("order-service")
.sampler(sampler)
.currentTraceContext(currentTraceContext())
.addSpanHandler(new LoggingSpanHandler())
.build();
}
// Custom span handler for additional processing
@Bean
public SpanHandler customSpanHandler() {
return new SpanHandler() {
@Override
public boolean end(TraceContext context, MutableSpan span, Cause cause) {
// Add custom logic before span is exported
if (span.error() != null) {
log.warn("Span {} ended with error: {}", span.name(), span.error());
}
// Add custom tags to all spans
span.tag("deployment.env", System.getenv("DEPLOYMENT_ENV"));
span.tag("application.version", "1.0.0");
return true; // Process this span
}
};
}
}
// Custom span handler for logging
class LoggingSpanHandler extends SpanHandler {
private static final Logger log = LoggerFactory.getLogger(LoggingSpanHandler.class);
@Override
public boolean end(TraceContext context, MutableSpan span, Cause cause) {
log.debug("Span completed: {} (Duration: {}ms)", 
span.name(), span.finishTimestamp() - span.startTimestamp());
return true;
}
}

Testing Distributed Tracing

Unit Test Example:

@SpringBootTest
@AutoConfigureTestDatabase
class OrderServiceTracingTest {
@Autowired
private TestRestTemplate restTemplate;
@Autowired
private Tracer tracer;
@Test
void whenCreateOrder_thenTraceIdPropagated() {
// Given
OrderRequest request = new OrderRequest("user-123", "product-456");
// When
ResponseEntity<Order> response = restTemplate.postForEntity(
"/orders", request, Order.class);
// Then - verify response contains trace headers
assertThat(response.getHeaders())
.containsKey("X-B3-TraceId")
.containsKey("X-B3-SpanId");
}
@Test
void whenManualSpanCreated_thenAppearsInTracer() {
// Given
Span span = tracer.nextSpan().name("test-operation").start();
// When
try (SpanInScope ws = tracer.withSpan(span)) {
// Perform operation
log.info("Testing manual span");
// Then
assertThat(tracer.currentSpan()).isNotNull();
assertThat(tracer.currentSpan().context().spanId())
.isEqualTo(span.context().spanId());
} finally {
span.end();
}
}
}

Integration Test:

@Testcontainers
@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
class DistributedTracingIntegrationTest {
@Container
static GenericContainer<?> zipkin = new GenericContainer<>("openzipkin/zipkin")
.withExposedPorts(9411);
@Test
void whenRequestFlowsThroughServices_thenCompleteTraceInZipkin() throws Exception {
// Given
String traceId = performDistributedRequest();
// When - wait for trace to appear in Zipkin
Thread.sleep(2000);
// Then - verify trace exists in Zipkin
String zipkinUrl = String.format("http://%s:%d/api/v2/trace/%s",
zipkin.getHost(), zipkin.getFirstMappedPort(), traceId);
ResponseEntity<String> response = restTemplate.getForEntity(zipkinUrl, String.class);
assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
assertThat(response.getBody()).contains(traceId);
}
private String performDistributedRequest() {
// Implementation that triggers multi-service call
return "mock-trace-id";
}
}

Troubleshooting Common Issues

Problem: Traces not appearing in Zipkin

# Solution: Check configuration
spring:
sleuth:
enabled: true
sampler:
probability: 1.0  # Ensure sampling is enabled
zipkin:
base-url: http://localhost:9411  # Correct Zipkin URL
enabled: true

Problem: Headers not propagating

// Solution: Manual propagation for non-Sleuth clients
@Bean
public WebClient webClient(Tracer tracer) {
return WebClient.builder()
.filter((request, next) -> {
Span currentSpan = tracer.currentSpan();
if (currentSpan != null) {
// Manually inject headers
request.headers().set("X-B3-TraceId", currentSpan.context().traceIdString());
request.headers().set("X-B3-SpanId", currentSpan.context().spanIdString());
}
return next.exchange(request);
})
.build();
}

Best Practices for Production

  1. Sampling Strategy: Use probability sampling (0.01-0.1) in production
  2. Storage Backend: Use scalable storage (Elasticsearch, Cassandra) for Zipkin
  3. Transport: Prefer message brokers (Kafka, RabbitMQ) over direct HTTP
  4. Error Handling: Implement fallback mechanisms for tracing failures
  5. Security: Secure Zipkin endpoint and consider data sensitivity
  6. Monitoring: Monitor tracing system health and performance
# Production-ready configuration
spring:
sleuth:
sampler:
probability: 0.05  # 5% sampling rate
zipkin:
sender:
type: kafka
kafka:
topic: zipkin
kafka:
bootstrap-servers: kafka-cluster:9092
management:
endpoints:
web:
exposure:
include: health,metrics,prometheus

Conclusion

Spring Cloud Sleuth and Zipkin provide a powerful combination for distributed tracing in Java microservices:

Key Benefits:

  • End-to-End Visibility: Track requests across service boundaries
  • Performance Insights: Identify bottlenecks and optimize performance
  • Debugging Efficiency: Quickly pinpoint failures in complex workflows
  • Dependency Analysis: Understand service relationships and impacts

Implementation Strategy:

  1. Start with automatic Sleuth instrumentation
  2. Add Zipkin for visualization and analysis
  3. Implement manual spans for critical business operations
  4. Configure appropriate sampling for production
  5. Monitor and optimize based on actual usage patterns

By implementing distributed tracing, you transform from blind debugging to systematic observability, making your microservices architecture truly manageable and maintainable at scale.

Leave a Reply

Your email address will not be published. Required fields are marked *


Macro Nepal Helper