CRIU (Checkpoint/Restore In Userspace) is a Linux technology that allows freezing a running application and checkpointing it to disk as a collection of files. This checkpoint can later be used to restore the application exactly where it left off. When combined with the JVM, this enables powerful state management capabilities for Java applications.
Understanding CRIU for JVM
What is CRIU?
CRIU can capture the complete state of a running process including:
- Memory pages
- CPU registers
- Open files
- Network connections
- Process relationships
JVM Integration
The JVM has built-in support for CRIU since JDK 17 (JEP 356), allowing Java applications to leverage checkpoint/restore functionality natively.
Prerequisites and Setup
System Requirements
# Check if your system supports CRIU which criu # Output should show criu path # Check kernel version (3.11+ required) uname -r # Install CRIU on Ubuntu/Debian sudo apt-get install criu # Install CRIU on RHEL/CentOS sudo yum install criu
JDK Requirements
// Requires JDK 17 or later
public class CheckpointDemo {
public static void main(String[] args) {
System.out.println("Java Version: " + System.getProperty("java.version"));
System.out.println("CRIU Supported: " +
jdk.crac.Core.getGlobalContext().isCheckpointSupported());
}
}
Basic CRIU Usage in Java
Example 1: Simple Checkpoint/Restore
import jdk.crac.*;
import java.io.*;
import java.util.concurrent.atomic.AtomicInteger;
public class SimpleCheckpointDemo {
private static final AtomicInteger counter = new AtomicInteger(0);
public static void main(String[] args) throws Exception {
System.out.println("Application started");
// Register resource for checkpoint/restore
Context context = Core.getGlobalContext();
context.register(new SimpleResource());
// Simulate work
for (int i = 0; i < 5; i++) {
int current = counter.incrementAndGet();
System.out.println("Counter: " + current);
Thread.sleep(1000);
// Checkpoint on 3rd iteration
if (current == 3) {
System.out.println("=== Creating checkpoint ===");
Core.checkpointRestore();
System.out.println("=== Restored from checkpoint ===");
}
}
System.out.println("Application completed. Final counter: " + counter.get());
}
static class SimpleResource implements Resource {
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
System.out.println("Before checkpoint: Saving state...");
// Close network connections, files, etc.
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
System.out.println("After restore: Reinitializing resources...");
// Reopen connections, restore state
}
}
}
Example 2: File-Based Checkpoint
import jdk.crac.*;
import java.nio.file.*;
import java.io.*;
public class FileBasedCheckpoint {
private static final String CHECKPOINT_DIR = "./checkpoint";
private static int sessionId = 0;
public static void main(String[] args) throws Exception {
// Read session ID from file if it exists
Path sessionFile = Paths.get(CHECKPOINT_DIR, "session.id");
if (Files.exists(sessionFile)) {
String content = Files.readString(sessionFile);
sessionId = Integer.parseInt(content.trim());
System.out.println("Restored session ID: " + sessionId);
} else {
sessionId = (int) (Math.random() * 1000);
System.out.println("New session ID: " + sessionId);
}
// Register checkpoint handler
Core.getGlobalContext().register(new CheckpointHandler());
// Simulate application work
for (int i = 0; i < 10; i++) {
System.out.printf("Session %d: Iteration %d%n", sessionId, i);
Thread.sleep(500);
if (i == 3 && !Files.exists(sessionFile)) {
saveSessionId(sessionFile);
System.out.println("=== Creating checkpoint ===");
Core.checkpointRestore();
}
}
cleanup(sessionFile);
System.out.println("Application finished successfully");
}
private static void saveSessionId(Path sessionFile) throws IOException {
Files.createDirectories(sessionFile.getParent());
Files.writeString(sessionFile, String.valueOf(sessionId));
}
private static void cleanup(Path sessionFile) throws IOException {
Files.deleteIfExists(sessionFile);
}
static class CheckpointHandler implements Resource {
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
System.out.println("Preparing for checkpoint...");
// Ensure all data is flushed to disk
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
System.out.println("Restoring after checkpoint...");
// Reinitialize any necessary resources
}
}
}
Advanced CRIU Patterns
Example 3: Database Connection Management
import jdk.crac.*;
import java.sql.*;
import java.util.concurrent.*;
public class DatabaseCheckpointDemo {
private static Connection connection;
private static final BlockingQueue<String> messageQueue = new LinkedBlockingQueue<>();
private static volatile boolean running = true;
public static void main(String[] args) throws Exception {
setupDatabaseConnection();
Core.getGlobalContext().register(new DatabaseResource());
// Start message processor
Thread processor = new Thread(() -> {
while (running) {
try {
String message = messageQueue.poll(100, TimeUnit.MILLISECONDS);
if (message != null) {
processMessage(message);
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
}
}
});
processor.start();
// Simulate message production
for (int i = 0; i < 20; i++) {
messageQueue.offer("Message-" + i);
Thread.sleep(200);
if (i == 8) {
System.out.println("=== Checkpointing with " + messageQueue.size() + " pending messages ===");
Core.checkpointRestore();
}
}
running = false;
processor.join();
cleanup();
}
private static void setupDatabaseConnection() throws SQLException {
// In real application, use proper connection pool
String url = "jdbc:h2:mem:test;DB_CLOSE_DELAY=-1";
connection = DriverManager.getConnection(url);
System.out.println("Database connection established");
}
private static void processMessage(String message) {
try {
System.out.println("Processing: " + message);
// Simulate database operation
Thread.sleep(50);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
private static void cleanup() throws SQLException {
if (connection != null) {
connection.close();
System.out.println("Database connection closed");
}
}
static class DatabaseResource implements Resource {
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
System.out.println("Closing database connection for checkpoint...");
if (connection != null && !connection.isClosed()) {
connection.close();
}
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
System.out.println("Reestablishing database connection after restore...");
setupDatabaseConnection();
}
}
}
Example 4: Network Service with CRIU
import jdk.crac.*;
import java.net.*;
import java.io.*;
import java.util.concurrent.*;
public class NetworkServiceCheckpoint {
private static ServerSocket serverSocket;
private static ExecutorService executor;
private static final int PORT = 8080;
private static volatile boolean accepting = true;
public static void main(String[] args) throws Exception {
executor = Executors.newFixedThreadPool(10);
Core.getGlobalContext().register(new NetworkResource());
startServer();
// Schedule checkpoint after 10 seconds
executor.schedule(() -> {
try {
System.out.println("=== Scheduled checkpoint ===");
Core.checkpointRestore();
} catch (Exception e) {
e.printStackTrace();
}
}, 10, TimeUnit.SECONDS);
// Keep server running
Thread.sleep(30000);
stopServer();
}
private static void startServer() throws IOException {
serverSocket = new ServerSocket(PORT);
System.out.println("Server started on port " + PORT);
Thread acceptor = new Thread(() -> {
while (accepting) {
try {
Socket client = serverSocket.accept();
executor.submit(new ClientHandler(client));
} catch (IOException e) {
if (accepting) {
e.printStackTrace();
}
}
}
});
acceptor.start();
}
private static void stopServer() throws IOException {
accepting = false;
if (serverSocket != null) {
serverSocket.close();
}
executor.shutdown();
System.out.println("Server stopped");
}
static class ClientHandler implements Runnable {
private final Socket client;
public ClientHandler(Socket client) {
this.client = client;
}
@Override
public void run() {
try (BufferedReader in = new BufferedReader(
new InputStreamReader(client.getInputStream()));
PrintWriter out = new PrintWriter(client.getOutputStream(), true)) {
String inputLine;
while ((inputLine = in.readLine()) != null) {
System.out.println("Received: " + inputLine);
out.println("Echo: " + inputLine);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
static class NetworkResource implements Resource {
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
System.out.println("Preparing network service for checkpoint...");
// Close server socket - new connections will be rejected during checkpoint
if (serverSocket != null) {
serverSocket.close();
}
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
System.out.println("Restoring network service...");
// Reopen server socket
startServer();
}
}
}
CRIU with Spring Boot Applications
Example 5: Spring Boot Integration
import jdk.crac.*;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.context.ConfigurableApplicationContext;
import org.springframework.web.bind.annotation.*;
import javax.annotation.PreDestroy;
import java.util.concurrent.atomic.AtomicLong;
@SpringBootApplication
@RestController
public class SpringBootCRIUDemo {
private static ConfigurableApplicationContext context;
private static final AtomicLong requestCount = new AtomicLong(0);
public static void main(String[] args) {
context = SpringApplication.run(SpringBootCRIUDemo.class, args);
// Register CRIU resource
Core.getGlobalContext().register(new SpringResource());
System.out.println("Spring Boot application started with CRIU support");
// Schedule checkpoint for demonstration
new Thread(() -> {
try {
Thread.sleep(15000);
System.out.println("=== Creating application checkpoint ===");
Core.checkpointRestore();
} catch (Exception e) {
e.printStackTrace();
}
}).start();
}
@GetMapping("/hello")
public String hello() {
long count = requestCount.incrementAndGet();
return String.format("Hello! Request count: %d (App instance)", count);
}
@GetMapping("/checkpoint")
public String checkpoint() {
try {
Core.checkpointRestore();
return "Checkpoint created and restored successfully";
} catch (Exception e) {
return "Checkpoint failed: " + e.getMessage();
}
}
@PreDestroy
public void cleanup() {
System.out.println("Spring application cleanup");
}
static class SpringResource implements Resource {
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
System.out.println("Preparing Spring application for checkpoint...");
// Close application context gracefully
if (SpringBootCRIUDemo.context != null) {
SpringBootCRIUDemo.context.close();
}
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
System.out.println("Restoring Spring application...");
// Restart Spring context
String[] args = new String[0];
SpringBootCRIUDemo.context = SpringApplication.run(SpringBootCRIUDemo.class, args);
}
}
}
Production-Grade CRIU Implementation
Example 6: Enterprise Checkpoint Manager
import jdk.crac.*;
import java.nio.file.*;
import java.time.*;
import java.util.*;
import java.util.concurrent.*;
import java.util.concurrent.atomic.*;
public class EnterpriseCheckpointManager {
private static final String CHECKPOINT_BASE = "/opt/app/checkpoints";
private static final AtomicInteger checkpointId = new AtomicInteger(0);
private static final Map<String, Object> applicationState = new ConcurrentHashMap<>();
private static ScheduledExecutorService scheduler;
public static void main(String[] args) throws Exception {
initializeApplication();
// Register comprehensive resource management
Core.getGlobalContext().register(new EnterpriseResource());
// Schedule periodic checkpoints
scheduler = Executors.newScheduledThreadPool(1);
scheduler.scheduleAtFixedRate(() -> {
try {
createCheckpoint("scheduled-" + System.currentTimeMillis());
} catch (Exception e) {
System.err.println("Scheduled checkpoint failed: " + e.getMessage());
}
}, 5, 5, TimeUnit.MINUTES);
// Simulate application work
simulateWorkload();
scheduler.shutdown();
}
private static void initializeApplication() throws IOException {
Files.createDirectories(Paths.get(CHECKPOINT_BASE));
// Load last checkpoint ID if available
Path idFile = Paths.get(CHECKPOINT_BASE, "last-checkpoint.id");
if (Files.exists(idFile)) {
String content = Files.readString(idFile);
checkpointId.set(Integer.parseInt(content.trim()));
System.out.println("Loaded checkpoint ID: " + checkpointId.get());
}
// Initialize application state
applicationState.put("startTime", Instant.now());
applicationState.put("processedItems", new AtomicLong(0));
applicationState.put("activeSessions", new ConcurrentHashMap<String, Session>());
}
private static void createCheckpoint(String reason) throws Exception {
int id = checkpointId.incrementAndGet();
System.out.printf("=== Creating checkpoint %d: %s ===%n", id, reason);
// Save checkpoint metadata
saveCheckpointMetadata(id, reason);
// Perform checkpoint
Core.checkpointRestore();
System.out.printf("=== Successfully restored from checkpoint %d ===%n", id);
}
private static void saveCheckpointMetadata(int id, String reason) throws IOException {
Path metaFile = Paths.get(CHECKPOINT_BASE, "checkpoint-" + id + ".meta");
Map<String, Object> metadata = new HashMap<>();
metadata.put("id", id);
metadata.put("timestamp", Instant.now());
metadata.put("reason", reason);
metadata.put("processedItems",
((AtomicLong) applicationState.get("processedItems")).get());
metadata.put("activeSessions",
((Map<?, ?>) applicationState.get("activeSessions")).size());
Files.writeString(metaFile, metadata.toString());
// Update last checkpoint ID
Files.writeString(Paths.get(CHECKPOINT_BASE, "last-checkpoint.id"),
String.valueOf(id));
}
private static void simulateWorkload() throws InterruptedException {
Random random = new Random();
for (int i = 0; i < 100; i++) {
// Simulate processing
((AtomicLong) applicationState.get("processedItems")).incrementAndGet();
// Simulate session activity
if (random.nextDouble() < 0.3) {
String sessionId = "session-" + random.nextInt(10);
((Map<String, Session>) applicationState.get("activeSessions"))
.put(sessionId, new Session(sessionId, Instant.now()));
}
Thread.sleep(1000);
// Manual checkpoint every 20 iterations
if (i % 20 == 0 && i > 0) {
createCheckpoint("manual-at-iteration-" + i);
}
}
}
static class Session {
String id;
Instant created;
Session(String id, Instant created) {
this.id = id;
this.created = created;
}
}
static class EnterpriseResource implements Resource {
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
System.out.println("=== Enterprise checkpoint preparation ===");
// 1. Close external connections
closeExternalConnections();
// 2. Flush all buffers
flushBuffers();
// 3. Save business state
saveBusinessState();
// 4. Stop background tasks
stopBackgroundTasks();
System.out.println("Ready for checkpoint");
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
System.out.println("=== Enterprise restore procedure ===");
// 1. Reinitialize connections
reinitializeConnections();
// 2. Restore business state
restoreBusinessState();
// 3. Restart background tasks
restartBackgroundTasks();
// 4. Recover interrupted operations
recoverOperations();
System.out.println("Restore completed successfully");
}
private void closeExternalConnections() {
System.out.println("Closing external connections...");
// Close database connections, network sockets, etc.
}
private void flushBuffers() {
System.out.println("Flushing buffers...");
// Ensure all data is written to persistent storage
}
private void saveBusinessState() throws IOException {
System.out.println("Saving business state...");
Path stateFile = Paths.get(CHECKPOINT_BASE, "business-state.json");
String state = "Application state: " + applicationState.toString();
Files.writeString(stateFile, state);
}
private void stopBackgroundTasks() {
System.out.println("Stopping background tasks...");
if (scheduler != null) {
scheduler.shutdown();
}
}
private void reinitializeConnections() {
System.out.println("Reinitializing connections...");
// Reopen database connections, network sockets, etc.
}
private void restoreBusinessState() throws IOException {
System.out.println("Restoring business state...");
Path stateFile = Paths.get(CHECKPOINT_BASE, "business-state.json");
if (Files.exists(stateFile)) {
String state = Files.readString(stateFile);
System.out.println("Restored state: " + state);
}
}
private void restartBackgroundTasks() {
System.out.println("Restarting background tasks...");
scheduler = Executors.newScheduledThreadPool(1);
}
private void recoverOperations() {
System.out.println("Recovering interrupted operations...");
// Implement recovery logic for interrupted business operations
}
}
}
Best Practices and Considerations
1. Resource Management
// Always properly manage resources during checkpoint/restore
public class ProperResourceManagement implements Resource {
private List<AutoCloseable> resources = new ArrayList<>();
public void addResource(AutoCloseable resource) {
resources.add(resource);
}
@Override
public void beforeCheckpoint(Context<? extends Resource> context) throws Exception {
for (AutoCloseable resource : resources) {
resource.close();
}
}
@Override
public void afterRestore(Context<? extends Resource> context) throws Exception {
// Reinitialize resources
for (AutoCloseable resource : resources) {
// Reopen connections, files, etc.
}
}
}
2. State Consistency
// Ensure consistent state across checkpoints
public class StateConsistency {
private final Object checkpointLock = new Object();
public void performCheckpoint() throws Exception {
synchronized (checkpointLock) {
// Ensure no ongoing operations during checkpoint
Core.checkpointRestore();
}
}
}
Use Cases for CRIU in JVM
- Fast Startup: Checkpoint after warm-up, restore instantly
- Live Migration: Move JVM instances between servers
- Fault Recovery: Restore from known good state
- Load Balancing: Quickly scale by restoring multiple instances
- Debugging: Capture and restore problematic states
Limitations and Considerations
- Platform Specific: Primarily Linux-based
- JVM State: Not all JVM state may be perfectly captured
- External Resources: Files, network connections need special handling
- Performance: Checkpoint creation has overhead
- Storage: Checkpoints can be large
Conclusion
CRIU checkpoint/restore represents a paradigm shift in JVM state management, offering:
- Instant Application Startup from warmed-up states
- Improved Fault Tolerance through state restoration
- Enhanced Deployment Strategies with live migration
- Better Resource Utilization through state reuse
While CRIU requires careful implementation and has some limitations, it provides powerful capabilities for enterprise Java applications where startup time, reliability, and state management are critical concerns. The JDK 17+ integration makes this technology increasingly accessible for Java developers building next-generation applications.