Building a Duplicate Code Detector in Java: From Concepts to Implementation

Duplicate code is one of the most common code smells, leading to maintenance nightmares, bug propagation, and increased technical debt. While sophisticated tools like SonarQube, PMD, and IntelliJ IDEA's built-in inspector exist, understanding how to build a basic duplicate code detector provides valuable insight into code analysis techniques.

This article guides you through the process of creating a simple yet effective duplicate code detector in Java, exploring various approaches from naive to advanced.


Core Concepts and Approaches

There are several ways to detect code duplication, each with different trade-offs between accuracy, performance, and complexity:

  1. Text-Based Comparison: Direct string matching (simplest but least robust).
  2. Token-Based Comparison: Compares sequences of tokens from the lexer (more robust to formatting changes).
  3. AST-Based Comparison: Compares Abstract Syntax Trees (handles structural changes but complex).
  4. Metric-Based Comparison: Uses code metrics like hash-based fingerprints (good for large codebases).

Approach 1: Text-Based Detector (Naive but Illustrative)

This approach uses direct string comparison with normalization. It's fragile but demonstrates the basic workflow.

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.*;
public class TextBasedDuplicateDetector {
public static void main(String[] args) throws IOException {
String basePath = "src/main/java/com/example";
List<CodeBlock> allBlocks = extractCodeBlocks(Paths.get(basePath));
List<DuplicatePair> duplicates = findDuplicates(allBlocks, 10); // Minimum 10 lines
System.out.println("Found " + duplicates.size() + " duplicate pairs:");
for (DuplicatePair pair : duplicates) {
System.out.println("Duplicate found:");
System.out.println("  File 1: " + pair.block1.filePath + " at line " + pair.block1.startLine);
System.out.println("  File 2: " + pair.block2.filePath + " at line " + pair.block2.startLine);
System.out.println("  Size: " + pair.block1.lines.size() + " lines");
}
}
public static List<CodeBlock> extractCodeBlocks(Path rootDir) throws IOException {
List<CodeBlock> blocks = new ArrayList<>();
Files.walk(rootDir)
.filter(path -> path.toString().endsWith(".java"))
.forEach(path -> {
try {
blocks.addAll(extractBlocksFromFile(path));
} catch (IOException e) {
System.err.println("Error reading file: " + path);
}
});
return blocks;
}
private static List<CodeBlock> extractBlocksFromFile(Path filePath) throws IOException {
List<CodeBlock> blocks = new ArrayList<>();
List<String> lines = Files.readAllLines(filePath);
// Simple approach: consider consecutive non-empty lines as a block
List<String> currentBlock = new ArrayList<>();
int startLine = -1;
for (int i = 0; i < lines.size(); i++) {
String line = normalizeLine(lines.get(i));
if (!line.trim().isEmpty() && !isIgnorableLine(line)) {
if (currentBlock.isEmpty()) {
startLine = i + 1;
}
currentBlock.add(line);
} else {
if (currentBlock.size() >= 3) { // Minimum block size
blocks.add(new CodeBlock(filePath, startLine, new ArrayList<>(currentBlock)));
}
currentBlock.clear();
}
}
// Don't forget the last block
if (currentBlock.size() >= 3) {
blocks.add(new CodeBlock(filePath, startLine, new ArrayList<>(currentBlock)));
}
return blocks;
}
private static String normalizeLine(String line) {
return line.trim()
.replaceAll("\\s+", " ") // Normalize whitespace
.replaceAll("//.*", ""); // Remove single-line comments
}
private static boolean isIgnorableLine(String line) {
return line.startsWith("import ") || line.startsWith("package ") || line.equals("{") || line.equals("}");
}
public static List<DuplicatePair> findDuplicates(List<CodeBlock> blocks, int minLines) {
List<DuplicatePair> duplicates = new ArrayList<>();
for (int i = 0; i < blocks.size(); i++) {
CodeBlock block1 = blocks.get(i);
if (block1.lines.size() < minLines) continue;
for (int j = i + 1; j < blocks.size(); j++) {
CodeBlock block2 = blocks.get(j);
if (block2.lines.size() < minLines) continue;
if (areBlocksSimilar(block1, block2)) {
duplicates.add(new DuplicatePair(block1, block2));
}
}
}
return duplicates;
}
private static boolean areBlocksSimilar(CodeBlock block1, CodeBlock block2) {
if (block1.lines.size() != block2.lines.size()) {
return false;
}
for (int i = 0; i < block1.lines.size(); i++) {
if (!block1.lines.get(i).equals(block2.lines.get(i))) {
return false;
}
}
return true;
}
static class CodeBlock {
Path filePath;
int startLine;
List<String> lines;
CodeBlock(Path filePath, int startLine, List<String> lines) {
this.filePath = filePath;
this.startLine = startLine;
this.lines = lines;
}
}
static class DuplicatePair {
CodeBlock block1;
CodeBlock block2;
DuplicatePair(CodeBlock block1, CodeBlock block2) {
this.block1 = block1;
this.block2 = block2;
}
}
}

Limitations of Text-Based Approach:

  • Breaks with different variable names
  • Affected by comments and formatting
  • Cannot handle reordered statements

Approach 2: Token-Based Detector (More Robust)

This approach uses Java's built-in lexer (Tokenizer) to create a sequence of tokens that's resilient to formatting changes.

import java.io.*;
import java.util.*;
import java.util.stream.Collectors;
public class TokenBasedDuplicateDetector {
public List<DuplicatePair> findTokenDuplicates(List<File> javaFiles, int minTokens) throws IOException {
List<TokenBlock> allTokenBlocks = new ArrayList<>();
for (File file : javaFiles) {
allTokenBlocks.addAll(extractTokenBlocksFromFile(file, minTokens));
}
return findSimilarTokenBlocks(allTokenBlocks, minTokens);
}
private List<TokenBlock> extractTokenBlocksFromFile(File file, int minTokens) throws IOException {
List<TokenBlock> blocks = new ArrayList<>();
List<String> tokens = tokenizeJavaFile(file);
// Create sliding windows of tokens
for (int i = 0; i <= tokens.size() - minTokens; i++) {
for (int windowSize = minTokens; windowSize <= Math.min(50, tokens.size() - i); windowSize++) {
List<String> tokenWindow = tokens.subList(i, i + windowSize);
blocks.add(new TokenBlock(file, i, new ArrayList<>(tokenWindow)));
}
}
return blocks;
}
private List<String> tokenizeJavaFile(File file) throws IOException {
List<String> tokens = new ArrayList<>();
try (Reader reader = new FileReader(file);
StreamTokenizer tokenizer = new StreamTokenizer(reader)) {
tokenizer.slashSlashComments(true);
tokenizer.slashStarComments(true);
tokenizer.ordinaryChar('.');
tokenizer.ordinaryChar('-');
tokenizer.ordinaryChar('/');
while (tokenizer.nextToken() != StreamTokenizer.TT_EOF) {
if (tokenizer.ttype == StreamTokenizer.TT_WORD) {
tokens.add(tokenizer.sval);
} else if (tokenizer.ttype == StreamTokenizer.TT_NUMBER) {
tokens.add("NUMBER");
} else {
tokens.add(String.valueOf((char) tokenizer.ttype));
}
}
}
return tokens;
}
private List<DuplicatePair> findSimilarTokenBlocks(List<TokenBlock> blocks, int minTokens) {
Map<String, List<TokenBlock>> blockFingerprints = new HashMap<>();
List<DuplicatePair> duplicates = new ArrayList<>();
for (TokenBlock block : blocks) {
if (block.tokens.size() < minTokens) continue;
String fingerprint = generateFingerprint(block.tokens);
List<TokenBlock> similarBlocks = blockFingerprints.getOrDefault(fingerprint, new ArrayList<>());
for (TokenBlock existingBlock : similarBlocks) {
if (!existingBlock.file.equals(block.file) || 
Math.abs(existingBlock.startPos - block.startPos) > 100) {
duplicates.add(new DuplicatePair(
new CodeBlock(existingBlock.file.toPath(), existingBlock.startPos, 
tokensToLines(existingBlock.tokens)),
new CodeBlock(block.file.toPath(), block.startPos, 
tokensToLines(block.tokens))
));
}
}
similarBlocks.add(block);
blockFingerprints.put(fingerprint, similarBlocks);
}
return duplicates;
}
private String generateFingerprint(List<String> tokens) {
// Simple fingerprint: join tokens with separator
return String.join("|", tokens);
}
private List<String> tokensToLines(List<String> tokens) {
// Convert tokens back to readable lines for display
String tokenString = String.join(" ", tokens);
return Arrays.asList(tokenString.split("(?<=\\.)"));
}
static class TokenBlock {
File file;
int startPos;
List<String> tokens;
TokenBlock(File file, int startPos, List<String> tokens) {
this.file = file;
this.startPos = startPos;
this.tokens = tokens;
}
}
}

Advantages of Token-Based Approach:

  • Immune to formatting changes (whitespace, line breaks)
  • Handles comments intelligently
  • More robust than text-based comparison

Approach 3: AST-Based Detection (Most Robust)

For production use, leveraging existing AST parsing libraries is recommended. Here's an example using Eclipse JDT Core:

import org.eclipse.jdt.core.dom.*;
public class ASTBasedDetector {
// This would use the Eclipse JDT parser to create ASTs
// and compare subtrees for structural similarity
public void detectASTDuplicates(CompilationUnit ast1, CompilationUnit ast2) {
// Compare method bodies, statement sequences
// Normalize variable names and literals
// Use tree matching algorithms
}
}

Best Practices for Duplicate Detection

  1. Configurable Thresholds:
  • Minimum duplicate size (lines/tokens)
  • Similarity threshold (e.g., 80% similar)
  • Ignore lists (test files, generated code)
  1. Performance Optimization:
  • Use hashing for quick comparison
  • Implement sliding window for large files
  • Parallel processing for large codebases
  1. Intelligent Normalization:
  • Normalize variable names
  • Ignore literal values
  • Handle different access modifiers
  1. Use Existing Tools:
  • PMD Copy-Paste Detector (CPD): Excellent token-based detector
  • SonarQube: Integrated duplicate detection
  • IntelliJ IDEA: Built-in duplicate analysis

Example Using PMD CPD (Production Ready)

<!-- Maven Dependency -->
<dependency>
<groupId>net.sourceforge.pmd</groupId>
<artifactId>pmd-core</artifactId>
<version>6.55.0</version>
</dependency>
import net.sourceforge.pmd.cpd.*;
public class PmdCPDExample {
public static void main(String[] args) {
CPDConfiguration config = new CPDConfiguration();
config.setLanguage(LanguageFactory.createLanguage("java"));
config.setMinimumTileSize(50); // Minimum token size
CPD cpd = new CPD(config);
try {
cpd.add(new File("src/main/java"));
cpd.go();
for (Match match : cpd.getMatches()) {
System.out.println("Found duplication:");
System.out.println("  Tokens: " + match.getTokenCount());
for (Mark mark : match) {
System.out.println("    " + mark.getFilename() + ":" + mark.getBeginLine());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Conclusion

Building a duplicate code detector reveals the complexities of code analysis. While simple text-based approaches work for basic cases, token-based methods using tools like PMD CPD provide the robustness needed for real-world applications. The key insights are:

  1. Normalization is crucial for handling different coding styles
  2. Tokenization provides a good balance of simplicity and robustness
  3. Existing tools like PMD CPD are production-ready and highly optimized

Understanding how these detectors work helps you make better decisions about code quality and refactoring priorities in your projects. For most practical purposes, leveraging established tools like PMD CPD combined with intelligent post-processing is the recommended approach.

Leave a Reply

Your email address will not be published. Required fields are marked *


Macro Nepal Helper