Automating Content Digestion: Text Summarization Techniques in Java

Text summarization has become an essential technology in our information-rich world, helping users quickly understand the essence of large documents, articles, and reports. While Python dominates the AI research space, Java offers robust, scalable solutions for text summarization that are ideal for enterprise applications. This article explores various text summarization approaches in Java, from simple extractive methods to advanced AI-powered abstractive techniques.

Table of Contents

Understanding Text Summarization

Types of Summarization:

Extractive Summarization: Selects and combines important sentences/phrases from the original text
Abstractive Summarization: Generates new sentences that capture the core meaning
Query-Focused Summarization: Tailors summary to answer specific questions
Multi-Document Summarization: Creates summaries from multiple related documents

Java's Advantages for Summarization:

Performance: Efficient text processing for large documents
Enterprise Integration: Easy integration with existing Java systems
Scalability: Handles high-volume summarization tasks
Multilingual Support: Strong internationalization capabilities

Setting Up Dependencies

Maven Dependencies:

<properties>
<opennlp.version>2.3.0</opennlp.version>
<stanfordnlp.version>4.5.4</stanfordnlp.version>
<deeplearning4j.version>1.0.0-M2.1</deeplearning4j.version>
</properties>
<dependencies>
<!-- OpenNLP for NLP tasks -->
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>${opennlp.version}</version>
</dependency>
<!-- Stanford CoreNLP -->
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>${stanfordnlp.version}</version>
</dependency>
<!-- DeepLearning4J for neural approaches -->
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-core</artifactId>
<version>${deeplearning4j.version}</version>
</dependency>
<!-- ND4J backend -->
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-native-platform</artifactId>
<version>${deeplearning4j.version}</version>
</dependency>
</dependencies>

Gradle:

dependencies {
implementation("org.apache.opennlp:opennlp-tools:2.3.0")
implementation("edu.stanford.nlp:stanford-corenlp:4.5.4")
implementation("org.deeplearning4j:deeplearning4j-core:1.0.0-M2.1")
implementation("org.nd4j:nd4j-native-platform:1.0.0-M2.1")
}

Approach 1: Simple Extractive Summarization with TF-IDF

This approach uses Term Frequency-Inverse Document Frequency to identify important sentences.

import java.util.*;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
public class TFIDFSummarizer {
private static final Pattern SENTENCE_DELIMITER = Pattern.compile("[.!?]");
private static final Pattern WORD_DELIMITER = Pattern.compile("\\s+");
public String summarize(String text, int summarySentenceCount) {
// Split text into sentences
List<String> sentences = splitSentences(text);
// Preprocess sentences
List<List<String>> tokenizedSentences = sentences.stream()
.map(this::tokenizeAndClean)
.collect(Collectors.toList());
// Calculate TF-IDF scores for each sentence
Map<String, Double> wordScores = calculateWordScores(tokenizedSentences);
// Score sentences based on word scores
List<SentenceScore> sentenceScores = scoreSentences(tokenizedSentences, wordScores, sentences);
// Select top sentences
return selectTopSentences(sentenceScores, summarySentenceCount);
}
private List<String> splitSentences(String text) {
return Arrays.stream(SENTENCE_DELIMITER.split(text))
.map(String::trim)
.filter(s -> !s.isEmpty())
.collect(Collectors.toList());
}
private List<String> tokenizeAndClean(String sentence) {
return Arrays.stream(WORD_DELIMITER.split(sentence.toLowerCase()))
.map(word -> word.replaceAll("[^a-zA-Z]", ""))
.filter(word -> word.length() > 2) // Remove short words
.collect(Collectors.toList());
}
private Map<String, Double> calculateWordScores(List<List<String>> sentences) {
Map<String, Integer> wordFrequency = new HashMap<>();
Map<String, Integer> wordDocumentFrequency = new HashMap<>();
// Count word frequencies
for (List<String> sentence : sentences) {
Set<String> uniqueWords = new HashSet<>(sentence);
for (String word : sentence) {
wordFrequency.put(word, wordFrequency.getOrDefault(word, 0) + 1);
}
for (String word : uniqueWords) {
wordDocumentFrequency.put(word, wordDocumentFrequency.getOrDefault(word, 0) + 1);
}
}
// Calculate TF-IDF like scores
Map<String, Double> wordScores = new HashMap<>();
int totalSentences = sentences.size();
for (String word : wordFrequency.keySet()) {
double tf = wordFrequency.get(word);
double idf = Math.log((double) totalSentences / (1 + wordDocumentFrequency.get(word)));
wordScores.put(word, tf * idf);
}
return wordScores;
}
private List<SentenceScore> scoreSentences(List<List<String>> tokenizedSentences, 
Map<String, Double> wordScores,
List<String> originalSentences) {
List<SentenceScore> sentenceScores = new ArrayList<>();
for (int i = 0; i < tokenizedSentences.size(); i++) {
List<String> sentence = tokenizedSentences.get(i);
double score = sentence.stream()
.mapToDouble(word -> wordScores.getOrDefault(word, 0.0))
.average()
.orElse(0.0);
// Boost score for sentences at beginning (often contain main ideas)
double positionBonus = 1.0 - (i * 0.1 / tokenizedSentences.size());
score *= positionBonus;
sentenceScores.add(new SentenceScore(originalSentences.get(i), score, i));
}
return sentenceScores;
}
private String selectTopSentences(List<SentenceScore> sentenceScores, int count) {
return sentenceScores.stream()
.sorted((a, b) -> Double.compare(b.score, a.score))
.limit(count)
.sorted((a, b) -> Integer.compare(a.originalPosition, b.originalPosition))
.map(score -> score.sentence)
.collect(Collectors.joining(". ")) + ".";
}
private static class SentenceScore {
final String sentence;
final double score;
final int originalPosition;
SentenceScore(String sentence, double score, int originalPosition) {
this.sentence = sentence;
this.score = score;
this.originalPosition = originalPosition;
}
}
public static void main(String[] args) {
TFIDFSummarizer summarizer = new TFIDFSummarizer();
String sampleText = "Artificial intelligence is transforming the way we live and work. " +
"Machine learning algorithms can now recognize patterns in data that humans would miss. " +
"Natural language processing enables computers to understand and generate human language. " +
"These technologies are being applied across various industries from healthcare to finance. " +
"The future of AI holds great promise for solving complex global challenges.";
String summary = summarizer.summarize(sampleText, 2);
System.out.println("Original Text: " + sampleText);
System.out.println("Summary: " + summary);
}
}

Approach 2: Advanced Extractive Summarization with OpenNLP

Using OpenNLP for better sentence detection and POS tagging.

import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import java.io.InputStream;
import java.util.*;
import java.util.stream.Collectors;
public class OpenNLPSummarizer {
private SentenceDetectorME sentenceDetector;
private TokenizerME tokenizer;
private POSTaggerME posTagger;
public OpenNLPSummarizer() {
try {
// Load models (you need to download these from OpenNLP)
sentenceDetector = loadSentenceDetector();
tokenizer = loadTokenizer();
posTagger = loadPOSTagger();
} catch (Exception e) {
throw new RuntimeException("Failed to initialize OpenNLP models", e);
}
}
public String summarize(String text, double compressionRatio) {
String[] sentences = sentenceDetector.sentDetect(text);
int targetSentenceCount = Math.max(1, (int) (sentences.length * compressionRatio));
List<SentenceInfo> sentenceInfos = Arrays.stream(sentences)
.map(this::analyzeSentence)
.collect(Collectors.toList());
// Score sentences using multiple features
scoreSentences(sentenceInfos);
// Select top sentences
return selectTopSentences(sentenceInfos, targetSentenceCount);
}
private SentenceInfo analyzeSentence(String sentence) {
String[] tokens = tokenizer.tokenize(sentence);
String[] posTags = posTagger.tag(tokens);
return new SentenceInfo(sentence, tokens, posTags);
}
private void scoreSentences(List<SentenceInfo> sentences) {
// Feature 1: Sentence length score (medium-length sentences are often important)
for (SentenceInfo sentence : sentences) {
double lengthScore = calculateLengthScore(sentence.tokens.length);
sentence.addFeature("length", lengthScore);
}
// Feature 2: Proper noun and entity score
for (SentenceInfo sentence : sentences) {
double entityScore = calculateEntityScore(sentence.posTags);
sentence.addFeature("entities", entityScore);
}
// Feature 3: Keyword score (sentences with frequent words)
Map<String, Integer> wordFreq = calculateWordFrequency(sentences);
for (SentenceInfo sentence : sentences) {
double keywordScore = calculateKeywordScore(sentence.tokens, wordFreq);
sentence.addFeature("keywords", keywordScore);
}
// Feature 4: Position score (first sentences are often important)
for (int i = 0; i < sentences.size(); i++) {
double positionScore = 1.0 - (i * 0.8 / sentences.size());
sentences.get(i).addFeature("position", positionScore);
}
// Calculate final scores
for (SentenceInfo sentence : sentences) {
double finalScore = sentence.features.values().stream()
.mapToDouble(Double::doubleValue)
.average()
.orElse(0.0);
sentence.finalScore = finalScore;
}
}
private double calculateLengthScore(int wordCount) {
// Ideal sentence length between 10-25 words
if (wordCount >= 10 && wordCount <= 25) return 1.0;
if (wordCount >= 5 && wordCount <= 30) return 0.7;
return 0.3;
}
private double calculateEntityScore(String[] posTags) {
long entityCount = Arrays.stream(posTags)
.filter(tag -> tag.startsWith("NNP") || tag.startsWith("NNPS")) // Proper nouns
.count();
return Math.min(entityCount / 3.0, 1.0);
}
private Map<String, Integer> calculateWordFrequency(List<SentenceInfo> sentences) {
Map<String, Integer> frequency = new HashMap<>();
for (SentenceInfo sentence : sentences) {
for (String token : sentence.tokens) {
String word = token.toLowerCase();
frequency.put(word, frequency.getOrDefault(word, 0) + 1);
}
}
return frequency;
}
private double calculateKeywordScore(String[] tokens, Map<String, Integer> wordFreq) {
int totalWords = tokens.length;
if (totalWords == 0) return 0.0;
long keywordCount = Arrays.stream(tokens)
.map(String::toLowerCase)
.filter(word -> wordFreq.getOrDefault(word, 0) > 1)
.count();
return (double) keywordCount / totalWords;
}
private String selectTopSentences(List<SentenceInfo> sentences, int targetCount) {
return sentences.stream()
.sorted((a, b) -> Double.compare(b.finalScore, a.finalScore))
.limit(targetCount)
.sorted(Comparator.comparingInt(s -> sentences.indexOf(s)))
.map(s -> s.sentence)
.collect(Collectors.joining(" "));
}
// Model loading methods
private SentenceDetectorME loadSentenceDetector() throws Exception {
try (InputStream modelIn = getClass().getResourceAsStream("/en-sent.bin")) {
return new SentenceDetectorME(new SentenceModel(modelIn));
}
}
private TokenizerME loadTokenizer() throws Exception {
try (InputStream modelIn = getClass().getResourceAsStream("/en-token.bin")) {
return new TokenizerME(new TokenizerModel(modelIn));
}
}
private POSTaggerME loadPOSTagger() throws Exception {
try (InputStream modelIn = getClass().getResourceAsStream("/en-pos-maxent.bin")) {
return new POSTaggerME(new POSModel(modelIn));
}
}
private static class SentenceInfo {
final String sentence;
final String[] tokens;
final String[] posTags;
final Map<String, Double> features;
double finalScore;
SentenceInfo(String sentence, String[] tokens, String[] posTags) {
this.sentence = sentence;
this.tokens = tokens;
this.posTags = posTags;
this.features = new HashMap<>();
}
void addFeature(String name, double score) {
features.put(name, score);
}
}
public static void main(String[] args) {
OpenNLPSummarizer summarizer = new OpenNLPSummarizer();
String article = "The rapid advancement of artificial intelligence is reshaping numerous industries. " +
"Healthcare professionals are using AI to diagnose diseases more accurately and quickly. " +
"Financial institutions employ machine learning algorithms to detect fraudulent transactions. " +
"Autonomous vehicles rely on computer vision systems to navigate roads safely. " +
"Natural language processing enables virtual assistants to understand and respond to human queries. " +
"The ethical implications of AI development continue to be a topic of intense debate.";
String summary = summarizer.summarize(article, 0.4); // 40% compression
System.out.println("Summary: " + summary);
}
}

Approach 3: Graph-Based TextRank Algorithm

Implementing the TextRank algorithm for extractive summarization.

import java.util.*;
import java.util.stream.Collectors;
public class TextRankSummarizer {
public String summarize(String text, int summarySentenceCount) {
List<String> sentences = preprocessText(text);
double[][] similarityMatrix = buildSimilarityMatrix(sentences);
double[] scores = textRank(similarityMatrix, 0.85, 100, 0.0001);
return buildSummary(sentences, scores, summarySentenceCount);
}
private List<String> preprocessText(String text) {
return Arrays.stream(text.split("[.!?]"))
.map(String::trim)
.filter(s -> !s.isEmpty())
.collect(Collectors.toList());
}
private double[][] buildSimilarityMatrix(List<String> sentences) {
int n = sentences.size();
double[][] matrix = new double[n][n];
List<Set<String>> sentenceWords = sentences.stream()
.map(this::extractWords)
.collect(Collectors.toList());
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
if (i == j) {
matrix[i][j] = 1.0;
} else {
matrix[i][j] = calculateSimilarity(
sentenceWords.get(i), sentenceWords.get(j));
}
}
}
return matrix;
}
private Set<String> extractWords(String sentence) {
return Arrays.stream(sentence.toLowerCase().split("\\s+"))
.map(word -> word.replaceAll("[^a-zA-Z]", ""))
.filter(word -> word.length() > 2)
.collect(Collectors.toSet());
}
private double calculateSimilarity(Set<String> words1, Set<String> words2) {
if (words1.isEmpty() || words2.isEmpty()) return 0.0;
Set<String> intersection = new HashSet<>(words1);
intersection.retainAll(words2);
Set<String> union = new HashSet<>(words1);
union.addAll(words2);
return (double) intersection.size() / union.size();
}
private double[] textRank(double[][] similarityMatrix, double damping, 
int maxIterations, double tolerance) {
int n = similarityMatrix.length;
double[] scores = new double[n];
Arrays.fill(scores, 1.0 / n); // Initialize with equal scores
// Normalize the similarity matrix
double[][] normalizedMatrix = normalizeMatrix(similarityMatrix);
// TextRank algorithm
for (int iter = 0; iter < maxIterations; iter++) {
double[] newScores = new double[n];
double maxChange = 0.0;
for (int i = 0; i < n; i++) {
double sum = 0.0;
for (int j = 0; j < n; j++) {
if (i != j) {
sum += normalizedMatrix[j][i] * scores[j];
}
}
newScores[i] = (1 - damping) / n + damping * sum;
maxChange = Math.max(maxChange, Math.abs(newScores[i] - scores[i]));
}
scores = newScores;
if (maxChange < tolerance) {
System.out.println("Converged after " + iter + " iterations");
break;
}
}
return scores;
}
private double[][] normalizeMatrix(double[][] matrix) {
int n = matrix.length;
double[][] normalized = new double[n][n];
for (int i = 0; i < n; i++) {
double sum = 0.0;
for (int j = 0; j < n; j++) {
sum += matrix[i][j];
}
if (sum > 0) {
for (int j = 0; j < n; j++) {
normalized[i][j] = matrix[i][j] / sum;
}
}
}
return normalized;
}
private String buildSummary(List<String> sentences, double[] scores, int count) {
List<SentenceScore> sentenceScores = new ArrayList<>();
for (int i = 0; i < sentences.size(); i++) {
sentenceScores.add(new SentenceScore(sentences.get(i), scores[i], i));
}
return sentenceScores.stream()
.sorted((a, b) -> Double.compare(b.score, a.score))
.limit(count)
.sorted((a, b) -> Integer.compare(a.position, b.position))
.map(ss -> ss.sentence)
.collect(Collectors.joining(". ")) + ".";
}
private static class SentenceScore {
final String sentence;
final double score;
final int position;
SentenceScore(String sentence, double score, int position) {
this.sentence = sentence;
this.score = score;
this.position = position;
}
}
public static void main(String[] args) {
TextRankSummarizer summarizer = new TextRankSummarizer();
String document = "Machine learning is a subset of artificial intelligence. " +
"It focuses on the development of algorithms that can learn from data. " +
"Deep learning uses neural networks with multiple layers. " +
"These techniques have revolutionized computer vision and natural language processing. " +
"Supervised learning requires labeled training data. " +
"Unsupervised learning finds patterns in unlabeled data.";
String summary = summarizer.summarize(document, 3);
System.out.println("TextRank Summary: " + summary);
}
}

Approach 4: Integration with External APIs (Hugging Face)

For abstractive summarization using pre-trained models.

import com.fasterxml.jackson.databind.ObjectMapper;
import okhttp3.*;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.TimeUnit;
public class HuggingFaceSummarizer {
private static final String API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn";
private static final String API_TOKEN = "your_huggingface_token";
private final OkHttpClient client;
private final ObjectMapper mapper;
public HuggingFaceSummarizer() {
this.client = new OkHttpClient.Builder()
.connectTimeout(30, TimeUnit.SECONDS)
.readTimeout(60, TimeUnit.SECONDS)
.build();
this.mapper = new ObjectMapper();
}
public String summarize(String text, int maxLength, int minLength) {
try {
Map<String, Object> payload = new HashMap<>();
payload.put("inputs", text);
payload.put("parameters", Map.of(
"max_length", maxLength,
"min_length", minLength,
"do_sample", false
));
String jsonPayload = mapper.writeValueAsString(payload);
Request request = new Request.Builder()
.url(API_URL)
.post(RequestBody.create(jsonPayload, MediaType.parse("application/json")))
.addHeader("Authorization", "Bearer " + API_TOKEN)
.build();
try (Response response = client.newCall(request).execute()) {
if (!response.isSuccessful()) {
throw new IOException("Unexpected code: " + response);
}
String responseBody = response.body().string();
Map[] result = mapper.readValue(responseBody, Map[].class);
if (result.length > 0) {
return (String) result[0].get("summary_text");
}
}
} catch (IOException e) {
throw new RuntimeException("API call failed", e);
}
return "";
}
public static void main(String[] args) {
HuggingFaceSummarizer summarizer = new HuggingFaceSummarizer();
String longText = "The field of artificial intelligence has seen remarkable progress " +
"in recent years. Deep learning models have achieved human-level performance in " +
"various tasks including image recognition and natural language processing. " +
"Large language models like GPT-4 can generate coherent and contextually relevant text. " +
"These advancements are transforming industries from healthcare to education.";
String summary = summarizer.summarize(longText, 100, 30);
System.out.println("Abstractive Summary: " + summary);
}
}

Performance Optimization and Best Practices

1. Caching and Preprocessing:

public class CachedSummarizer {
private final Map<String, String> summaryCache = new HashMap<>();
private final TextRankSummarizer delegate;
public CachedSummarizer() {
this.delegate = new TextRankSummarizer();
}
public String summarize(String text, int sentenceCount) {
String key = generateKey(text, sentenceCount);
return summaryCache.computeIfAbsent(key, 
k -> delegate.summarize(text, sentenceCount));
}
private String generateKey(String text, int sentenceCount) {
return text.hashCode() + "_" + sentenceCount;
}
}

2. Batch Processing:

public class BatchSummarizer {
private final TFIDFSummarizer summarizer;
public BatchSummarizer() {
this.summarizer = new TFIDFSummarizer();
}
public Map<String, String> summarizeBatch(List<String> documents, int sentenceCount) {
return documents.parallelStream()
.collect(Collectors.toConcurrentMap(
doc -> doc.substring(0, Math.min(50, doc.length())), // key
doc -> summarizer.summarize(doc, sentenceCount)      // value
));
}
}

3. Evaluation Metrics:

public class SummaryEvaluator {
public double evaluateROUGE(String generatedSummary, String referenceSummary) {
// Basic implementation - in practice, use proper ROUGE implementation
Set<String> generatedWords = extractWords(generatedSummary);
Set<String> referenceWords = extractWords(referenceSummary);
Set<String> intersection = new HashSet<>(generatedWords);
intersection.retainAll(referenceWords);
return (double) intersection.size() / referenceWords.size();
}
private Set<String> extractWords(String text) {
return Arrays.stream(text.toLowerCase().split("\\s+"))
.map(word -> word.replaceAll("[^a-zA-Z]", ""))
.filter(word -> word.length() > 0)
.collect(Collectors.toSet());
}
}

Choosing the Right Approach

Approach	Best For	Pros	Cons
TF-IDF	Simple documents, quick implementation	Fast, easy to implement	Limited semantic understanding
TextRank	Well-structured articles	Captures semantic relationships	Computationally intensive
OpenNLP	Complex documents with entities	Good linguistic analysis	Requires model files
API-Based	High-quality abstractive summaries	State-of-the-art results	External dependency, cost

Conclusion

Text summarization in Java offers several powerful approaches:

TF-IDF Summarization: Best for simple, fast implementations
TextRank Algorithm: Excellent for capturing semantic relationships
OpenNLP Integration: Provides sophisticated linguistic analysis
API-Based Solutions: Offers cutting-edge abstractive summarization

Key Considerations:

Use Case: Choose extractive for factual content, abstractive for creative content
Performance: Consider computational requirements for your application
Accuracy: Evaluate different approaches with your specific content
Integration: Consider how the summarizer fits into your overall architecture

By leveraging Java's robust ecosystem and performance characteristics, you can build efficient, scalable text summarization systems that meet enterprise requirements while providing meaningful content digestion capabilities for end users.