In production Java environments, logging is essential for debugging, monitoring, and auditing. However, high-volume applications can generate massive log data, leading to significant costs, performance overhead, and "noise" that obscures critical issues. Log sampling—the practice of capturing only a representative subset of log entries—is a crucial strategy for maintaining observability while ensuring system performance and managing operational costs.
What is Log Sampling?
Log sampling is a technique where instead of logging every single event, the application selectively records a percentage of log entries or applies specific criteria to determine which logs to keep. This reduces the volume of log data while still providing meaningful insights into application behavior, especially for high-frequency events.
Why Log Sampling is Essential for Production Java Applications
- Performance Overhead Reduction: Every log statement involves CPU cycles for string formatting, I/O operations, and network traffic. In high-throughput systems, this can directly impact latency and throughput.
- Cost Management: Log storage, processing, and analysis services charge based on volume. Sampling can reduce these costs by 90% or more while retaining diagnostic capability.
- Signal vs. Noise Ratio: When every request is logged, truly important errors and anomalies can be buried in repetitive, normal traffic. Sampling helps surface significant events.
- Infrastructure Strain: Reducing log volume decreases pressure on logging pipelines, log shippers, and central log aggregators, preventing backpressure and data loss.
Common Log Sampling Strategies for Java
1. Head-Based Sampling (Deterministic)
A consistent sampling decision is made at the start of a request/trace and applied throughout its lifecycle.
- Fixed Rate Sampling: Log a fixed percentage of requests (e.g., 1% or 10%).
- Implementation: Typically configured in the logging framework or application performance monitoring (APM) agent.
Example using Logback with MDC (Mapped Diagnostic Context):
<!-- logback.xml -->
<configuration>
<turboFilter class="ch.qos.logback.classic.turbo.MarkerFilter">
<Name>SAMPLING_FILTER</Name>
<Marker>SAMPLED</Marker>
<OnMatch>ACCEPT</OnMatch>
</turboFilter>
<appender name="FILE" class="ch.qos.logback.core.FileAppender">
<file>application.log</file>
<encoder>
<pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="FILE" />
</root>
</configuration>
// Java Service with Sampling Logic
@Service
public class PaymentService {
private static final Logger logger = LoggerFactory.getLogger(PaymentService.class);
private static final Random random = new Random();
private static final double SAMPLING_RATE = 0.01; // 1% sampling
public PaymentResult processPayment(PaymentRequest request) {
boolean shouldSample = random.nextDouble() < SAMPLING_RATE;
if (shouldSample) {
MDC.put("sampled", "true");
logger.info("Processing payment: {}", request);
}
try {
// Business logic
PaymentResult result = // ... process payment
if (shouldSample) {
logger.info("Payment processed successfully: {}", result);
}
return result;
} catch (Exception e) {
// Always log errors - sampling doesn't apply to exceptions
logger.error("Payment processing failed", e);
throw e;
} finally {
MDC.clear();
}
}
}
2. Tail-Based Sampling (Adaptive)
Make sampling decisions based on the outcome or characteristics of the request.
- Error-First Sampling: Always log errors and exceptions, but sample successful requests.
- Latency-Based Sampling: Log slow requests above a certain threshold.
- Implementation: Requires buffering log data during request processing and making decisions at the end.
Example using Spring AOP for Tail-Based Sampling:
@Aspect
@Component
public class LogSamplingAspect {
private static final Logger logger = LoggerFactory.getLogger(LogSamplingAspect.class);
private static final Random random = new Random();
private static final double SUCCESS_SAMPLING_RATE = 0.05; // 5% for successful requests
@Around("execution(* com.example.service.*.*(..))")
public Object logServiceMethod(ProceedingJoinPoint joinPoint) throws Throwable {
long startTime = System.currentTimeMillis();
String methodName = joinPoint.getSignature().getName();
Object[] args = joinPoint.getArgs();
try {
Object result = joinPoint.proceed();
long duration = System.currentTimeMillis() - startTime;
// Always log slow requests
if (duration > 1000) { // 1 second threshold
logger.warn("Slow request detected - Method: {}, Duration: {}ms, Args: {}",
methodName, duration, Arrays.toString(args));
}
// Sample successful requests
else if (random.nextDouble() < SUCCESS_SAMPLING_RATE) {
logger.info("Request sampled - Method: {}, Duration: {}ms",
methodName, duration);
}
return result;
} catch (Exception e) {
// Always log errors
logger.error("Error in method: {} with args: {}", methodName,
Arrays.toString(args), e);
throw e;
}
}
}
3. Rate-Limited Sampling
Ensure that logging doesn't exceed a certain rate, regardless of traffic volume.
Example using a Rate Limiter:
@Component
public class RateLimitedLogger {
private final Logger logger = LoggerFactory.getLogger(getClass());
private final RateLimiter rateLimiter = RateLimiter.create(10.0); // 10 logs per second
public void debug(String message) {
if (rateLimiter.tryAcquire()) {
logger.debug(message);
}
}
public void info(String message) {
if (rateLimiter.tryAcquire()) {
logger.info(message);
}
}
}
Best Practices for Java Log Sampling
- Never Sample Errors: Always log exceptions, errors, and warnings at full volume.
- Sample Strategically: Use higher sampling rates for development/testing environments and lower rates for production.
- Maintain Context: When sampling, ensure you capture complete request context for the sampled entries.
- Use Structured Logging: JSON-structured logs make sampled data more valuable and easier to analyze.
- Dynamic Configuration: Implement the ability to change sampling rates without redeploying the application.
Example with Structured Logging and Dynamic Sampling:
@Component
public class StructuredLogger {
private static final ObjectMapper mapper = new ObjectMapper();
private double samplingRate = 0.01; // Configurable at runtime
public void logSampledEvent(String eventType, Map<String, Object> data) {
if (Math.random() < samplingRate) {
Map<String, Object> logEntry = Map.of(
"timestamp", Instant.now().toString(),
"eventType", eventType,
"data", data,
"sampled", true
);
try {
System.out.println(mapper.writeValueAsString(logEntry));
} catch (JsonProcessingException e) {
// Fallback to traditional logging
LoggerFactory.getLogger(getClass())
.info("Sampled event: {} - {}", eventType, data);
}
}
}
@Scheduled(fixedRate = 30000) // Update every 30 seconds
public void refreshSamplingRate() {
// Fetch from configuration service, feature flag system, or environment
this.samplingRate = Double.parseDouble(
System.getProperty("log.sampling.rate", "0.01"));
}
}
Integration with Modern Observability Stacks
- OpenTelemetry: Use the OpenTelemetry Java SDK for distributed tracing with built-in sampling capabilities.
- APM Tools: Tools like Datadog, New Relic, and Dynatrace offer sophisticated sampling configurations.
- Log Management Platforms: Splunk, Elasticsearch, and Loki support sampling at the ingestion or query level.
When to Avoid Log Sampling
- Regulatory Requirements: When complete audit trails are legally mandated.
- Low-Volume Applications: When log volume doesn't impact performance or costs.
- Debugging Specific Issues: Temporarily disable sampling when investigating particular problems.
Conclusion
Log sampling is not about logging less—it's about logging smarter. For Java applications in production, implementing a thoughtful sampling strategy is essential for balancing the need for comprehensive observability with the practical constraints of performance, cost, and operational complexity. By combining head-based and tail-based sampling approaches, focusing on strategic data capture, and integrating with modern observability tools, Java teams can maintain excellent system visibility while ensuring their applications remain performant and cost-effective at scale.