Automating Content Digestion: Text Summarization Techniques in Java

Text summarization has become an essential technology in our information-rich world, helping users quickly understand the essence of large documents, articles, and reports. While Python dominates the AI research space, Java offers robust, scalable solutions for text summarization that are ideal for enterprise applications. This article explores various text summarization approaches in Java, from simple extractive methods to advanced AI-powered abstractive techniques.


Understanding Text Summarization

Types of Summarization:

  • Extractive Summarization: Selects and combines important sentences/phrases from the original text
  • Abstractive Summarization: Generates new sentences that capture the core meaning
  • Query-Focused Summarization: Tailors summary to answer specific questions
  • Multi-Document Summarization: Creates summaries from multiple related documents

Java's Advantages for Summarization:

  • Performance: Efficient text processing for large documents
  • Enterprise Integration: Easy integration with existing Java systems
  • Scalability: Handles high-volume summarization tasks
  • Multilingual Support: Strong internationalization capabilities

Setting Up Dependencies

Maven Dependencies:

<properties>
<opennlp.version>2.3.0</opennlp.version>
<stanfordnlp.version>4.5.4</stanfordnlp.version>
<deeplearning4j.version>1.0.0-M2.1</deeplearning4j.version>
</properties>
<dependencies>
<!-- OpenNLP for NLP tasks -->
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>${opennlp.version}</version>
</dependency>
<!-- Stanford CoreNLP -->
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>${stanfordnlp.version}</version>
</dependency>
<!-- DeepLearning4J for neural approaches -->
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-core</artifactId>
<version>${deeplearning4j.version}</version>
</dependency>
<!-- ND4J backend -->
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-native-platform</artifactId>
<version>${deeplearning4j.version}</version>
</dependency>
</dependencies>

Gradle:

dependencies {
implementation("org.apache.opennlp:opennlp-tools:2.3.0")
implementation("edu.stanford.nlp:stanford-corenlp:4.5.4")
implementation("org.deeplearning4j:deeplearning4j-core:1.0.0-M2.1")
implementation("org.nd4j:nd4j-native-platform:1.0.0-M2.1")
}

Approach 1: Simple Extractive Summarization with TF-IDF

This approach uses Term Frequency-Inverse Document Frequency to identify important sentences.

import java.util.*;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
public class TFIDFSummarizer {
private static final Pattern SENTENCE_DELIMITER = Pattern.compile("[.!?]");
private static final Pattern WORD_DELIMITER = Pattern.compile("\\s+");
public String summarize(String text, int summarySentenceCount) {
// Split text into sentences
List<String> sentences = splitSentences(text);
// Preprocess sentences
List<List<String>> tokenizedSentences = sentences.stream()
.map(this::tokenizeAndClean)
.collect(Collectors.toList());
// Calculate TF-IDF scores for each sentence
Map<String, Double> wordScores = calculateWordScores(tokenizedSentences);
// Score sentences based on word scores
List<SentenceScore> sentenceScores = scoreSentences(tokenizedSentences, wordScores, sentences);
// Select top sentences
return selectTopSentences(sentenceScores, summarySentenceCount);
}
private List<String> splitSentences(String text) {
return Arrays.stream(SENTENCE_DELIMITER.split(text))
.map(String::trim)
.filter(s -> !s.isEmpty())
.collect(Collectors.toList());
}
private List<String> tokenizeAndClean(String sentence) {
return Arrays.stream(WORD_DELIMITER.split(sentence.toLowerCase()))
.map(word -> word.replaceAll("[^a-zA-Z]", ""))
.filter(word -> word.length() > 2) // Remove short words
.collect(Collectors.toList());
}
private Map<String, Double> calculateWordScores(List<List<String>> sentences) {
Map<String, Integer> wordFrequency = new HashMap<>();
Map<String, Integer> wordDocumentFrequency = new HashMap<>();
// Count word frequencies
for (List<String> sentence : sentences) {
Set<String> uniqueWords = new HashSet<>(sentence);
for (String word : sentence) {
wordFrequency.put(word, wordFrequency.getOrDefault(word, 0) + 1);
}
for (String word : uniqueWords) {
wordDocumentFrequency.put(word, wordDocumentFrequency.getOrDefault(word, 0) + 1);
}
}
// Calculate TF-IDF like scores
Map<String, Double> wordScores = new HashMap<>();
int totalSentences = sentences.size();
for (String word : wordFrequency.keySet()) {
double tf = wordFrequency.get(word);
double idf = Math.log((double) totalSentences / (1 + wordDocumentFrequency.get(word)));
wordScores.put(word, tf * idf);
}
return wordScores;
}
private List<SentenceScore> scoreSentences(List<List<String>> tokenizedSentences, 
Map<String, Double> wordScores,
List<String> originalSentences) {
List<SentenceScore> sentenceScores = new ArrayList<>();
for (int i = 0; i < tokenizedSentences.size(); i++) {
List<String> sentence = tokenizedSentences.get(i);
double score = sentence.stream()
.mapToDouble(word -> wordScores.getOrDefault(word, 0.0))
.average()
.orElse(0.0);
// Boost score for sentences at beginning (often contain main ideas)
double positionBonus = 1.0 - (i * 0.1 / tokenizedSentences.size());
score *= positionBonus;
sentenceScores.add(new SentenceScore(originalSentences.get(i), score, i));
}
return sentenceScores;
}
private String selectTopSentences(List<SentenceScore> sentenceScores, int count) {
return sentenceScores.stream()
.sorted((a, b) -> Double.compare(b.score, a.score))
.limit(count)
.sorted((a, b) -> Integer.compare(a.originalPosition, b.originalPosition))
.map(score -> score.sentence)
.collect(Collectors.joining(". ")) + ".";
}
private static class SentenceScore {
final String sentence;
final double score;
final int originalPosition;
SentenceScore(String sentence, double score, int originalPosition) {
this.sentence = sentence;
this.score = score;
this.originalPosition = originalPosition;
}
}
public static void main(String[] args) {
TFIDFSummarizer summarizer = new TFIDFSummarizer();
String sampleText = "Artificial intelligence is transforming the way we live and work. " +
"Machine learning algorithms can now recognize patterns in data that humans would miss. " +
"Natural language processing enables computers to understand and generate human language. " +
"These technologies are being applied across various industries from healthcare to finance. " +
"The future of AI holds great promise for solving complex global challenges.";
String summary = summarizer.summarize(sampleText, 2);
System.out.println("Original Text: " + sampleText);
System.out.println("Summary: " + summary);
}
}

Approach 2: Advanced Extractive Summarization with OpenNLP

Using OpenNLP for better sentence detection and POS tagging.

import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import java.io.InputStream;
import java.util.*;
import java.util.stream.Collectors;
public class OpenNLPSummarizer {
private SentenceDetectorME sentenceDetector;
private TokenizerME tokenizer;
private POSTaggerME posTagger;
public OpenNLPSummarizer() {
try {
// Load models (you need to download these from OpenNLP)
sentenceDetector = loadSentenceDetector();
tokenizer = loadTokenizer();
posTagger = loadPOSTagger();
} catch (Exception e) {
throw new RuntimeException("Failed to initialize OpenNLP models", e);
}
}
public String summarize(String text, double compressionRatio) {
String[] sentences = sentenceDetector.sentDetect(text);
int targetSentenceCount = Math.max(1, (int) (sentences.length * compressionRatio));
List<SentenceInfo> sentenceInfos = Arrays.stream(sentences)
.map(this::analyzeSentence)
.collect(Collectors.toList());
// Score sentences using multiple features
scoreSentences(sentenceInfos);
// Select top sentences
return selectTopSentences(sentenceInfos, targetSentenceCount);
}
private SentenceInfo analyzeSentence(String sentence) {
String[] tokens = tokenizer.tokenize(sentence);
String[] posTags = posTagger.tag(tokens);
return new SentenceInfo(sentence, tokens, posTags);
}
private void scoreSentences(List<SentenceInfo> sentences) {
// Feature 1: Sentence length score (medium-length sentences are often important)
for (SentenceInfo sentence : sentences) {
double lengthScore = calculateLengthScore(sentence.tokens.length);
sentence.addFeature("length", lengthScore);
}
// Feature 2: Proper noun and entity score
for (SentenceInfo sentence : sentences) {
double entityScore = calculateEntityScore(sentence.posTags);
sentence.addFeature("entities", entityScore);
}
// Feature 3: Keyword score (sentences with frequent words)
Map<String, Integer> wordFreq = calculateWordFrequency(sentences);
for (SentenceInfo sentence : sentences) {
double keywordScore = calculateKeywordScore(sentence.tokens, wordFreq);
sentence.addFeature("keywords", keywordScore);
}
// Feature 4: Position score (first sentences are often important)
for (int i = 0; i < sentences.size(); i++) {
double positionScore = 1.0 - (i * 0.8 / sentences.size());
sentences.get(i).addFeature("position", positionScore);
}
// Calculate final scores
for (SentenceInfo sentence : sentences) {
double finalScore = sentence.features.values().stream()
.mapToDouble(Double::doubleValue)
.average()
.orElse(0.0);
sentence.finalScore = finalScore;
}
}
private double calculateLengthScore(int wordCount) {
// Ideal sentence length between 10-25 words
if (wordCount >= 10 && wordCount <= 25) return 1.0;
if (wordCount >= 5 && wordCount <= 30) return 0.7;
return 0.3;
}
private double calculateEntityScore(String[] posTags) {
long entityCount = Arrays.stream(posTags)
.filter(tag -> tag.startsWith("NNP") || tag.startsWith("NNPS")) // Proper nouns
.count();
return Math.min(entityCount / 3.0, 1.0);
}
private Map<String, Integer> calculateWordFrequency(List<SentenceInfo> sentences) {
Map<String, Integer> frequency = new HashMap<>();
for (SentenceInfo sentence : sentences) {
for (String token : sentence.tokens) {
String word = token.toLowerCase();
frequency.put(word, frequency.getOrDefault(word, 0) + 1);
}
}
return frequency;
}
private double calculateKeywordScore(String[] tokens, Map<String, Integer> wordFreq) {
int totalWords = tokens.length;
if (totalWords == 0) return 0.0;
long keywordCount = Arrays.stream(tokens)
.map(String::toLowerCase)
.filter(word -> wordFreq.getOrDefault(word, 0) > 1)
.count();
return (double) keywordCount / totalWords;
}
private String selectTopSentences(List<SentenceInfo> sentences, int targetCount) {
return sentences.stream()
.sorted((a, b) -> Double.compare(b.finalScore, a.finalScore))
.limit(targetCount)
.sorted(Comparator.comparingInt(s -> sentences.indexOf(s)))
.map(s -> s.sentence)
.collect(Collectors.joining(" "));
}
// Model loading methods
private SentenceDetectorME loadSentenceDetector() throws Exception {
try (InputStream modelIn = getClass().getResourceAsStream("/en-sent.bin")) {
return new SentenceDetectorME(new SentenceModel(modelIn));
}
}
private TokenizerME loadTokenizer() throws Exception {
try (InputStream modelIn = getClass().getResourceAsStream("/en-token.bin")) {
return new TokenizerME(new TokenizerModel(modelIn));
}
}
private POSTaggerME loadPOSTagger() throws Exception {
try (InputStream modelIn = getClass().getResourceAsStream("/en-pos-maxent.bin")) {
return new POSTaggerME(new POSModel(modelIn));
}
}
private static class SentenceInfo {
final String sentence;
final String[] tokens;
final String[] posTags;
final Map<String, Double> features;
double finalScore;
SentenceInfo(String sentence, String[] tokens, String[] posTags) {
this.sentence = sentence;
this.tokens = tokens;
this.posTags = posTags;
this.features = new HashMap<>();
}
void addFeature(String name, double score) {
features.put(name, score);
}
}
public static void main(String[] args) {
OpenNLPSummarizer summarizer = new OpenNLPSummarizer();
String article = "The rapid advancement of artificial intelligence is reshaping numerous industries. " +
"Healthcare professionals are using AI to diagnose diseases more accurately and quickly. " +
"Financial institutions employ machine learning algorithms to detect fraudulent transactions. " +
"Autonomous vehicles rely on computer vision systems to navigate roads safely. " +
"Natural language processing enables virtual assistants to understand and respond to human queries. " +
"The ethical implications of AI development continue to be a topic of intense debate.";
String summary = summarizer.summarize(article, 0.4); // 40% compression
System.out.println("Summary: " + summary);
}
}

Approach 3: Graph-Based TextRank Algorithm

Implementing the TextRank algorithm for extractive summarization.

import java.util.*;
import java.util.stream.Collectors;
public class TextRankSummarizer {
public String summarize(String text, int summarySentenceCount) {
List<String> sentences = preprocessText(text);
double[][] similarityMatrix = buildSimilarityMatrix(sentences);
double[] scores = textRank(similarityMatrix, 0.85, 100, 0.0001);
return buildSummary(sentences, scores, summarySentenceCount);
}
private List<String> preprocessText(String text) {
return Arrays.stream(text.split("[.!?]"))
.map(String::trim)
.filter(s -> !s.isEmpty())
.collect(Collectors.toList());
}
private double[][] buildSimilarityMatrix(List<String> sentences) {
int n = sentences.size();
double[][] matrix = new double[n][n];
List<Set<String>> sentenceWords = sentences.stream()
.map(this::extractWords)
.collect(Collectors.toList());
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
if (i == j) {
matrix[i][j] = 1.0;
} else {
matrix[i][j] = calculateSimilarity(
sentenceWords.get(i), sentenceWords.get(j));
}
}
}
return matrix;
}
private Set<String> extractWords(String sentence) {
return Arrays.stream(sentence.toLowerCase().split("\\s+"))
.map(word -> word.replaceAll("[^a-zA-Z]", ""))
.filter(word -> word.length() > 2)
.collect(Collectors.toSet());
}
private double calculateSimilarity(Set<String> words1, Set<String> words2) {
if (words1.isEmpty() || words2.isEmpty()) return 0.0;
Set<String> intersection = new HashSet<>(words1);
intersection.retainAll(words2);
Set<String> union = new HashSet<>(words1);
union.addAll(words2);
return (double) intersection.size() / union.size();
}
private double[] textRank(double[][] similarityMatrix, double damping, 
int maxIterations, double tolerance) {
int n = similarityMatrix.length;
double[] scores = new double[n];
Arrays.fill(scores, 1.0 / n); // Initialize with equal scores
// Normalize the similarity matrix
double[][] normalizedMatrix = normalizeMatrix(similarityMatrix);
// TextRank algorithm
for (int iter = 0; iter < maxIterations; iter++) {
double[] newScores = new double[n];
double maxChange = 0.0;
for (int i = 0; i < n; i++) {
double sum = 0.0;
for (int j = 0; j < n; j++) {
if (i != j) {
sum += normalizedMatrix[j][i] * scores[j];
}
}
newScores[i] = (1 - damping) / n + damping * sum;
maxChange = Math.max(maxChange, Math.abs(newScores[i] - scores[i]));
}
scores = newScores;
if (maxChange < tolerance) {
System.out.println("Converged after " + iter + " iterations");
break;
}
}
return scores;
}
private double[][] normalizeMatrix(double[][] matrix) {
int n = matrix.length;
double[][] normalized = new double[n][n];
for (int i = 0; i < n; i++) {
double sum = 0.0;
for (int j = 0; j < n; j++) {
sum += matrix[i][j];
}
if (sum > 0) {
for (int j = 0; j < n; j++) {
normalized[i][j] = matrix[i][j] / sum;
}
}
}
return normalized;
}
private String buildSummary(List<String> sentences, double[] scores, int count) {
List<SentenceScore> sentenceScores = new ArrayList<>();
for (int i = 0; i < sentences.size(); i++) {
sentenceScores.add(new SentenceScore(sentences.get(i), scores[i], i));
}
return sentenceScores.stream()
.sorted((a, b) -> Double.compare(b.score, a.score))
.limit(count)
.sorted((a, b) -> Integer.compare(a.position, b.position))
.map(ss -> ss.sentence)
.collect(Collectors.joining(". ")) + ".";
}
private static class SentenceScore {
final String sentence;
final double score;
final int position;
SentenceScore(String sentence, double score, int position) {
this.sentence = sentence;
this.score = score;
this.position = position;
}
}
public static void main(String[] args) {
TextRankSummarizer summarizer = new TextRankSummarizer();
String document = "Machine learning is a subset of artificial intelligence. " +
"It focuses on the development of algorithms that can learn from data. " +
"Deep learning uses neural networks with multiple layers. " +
"These techniques have revolutionized computer vision and natural language processing. " +
"Supervised learning requires labeled training data. " +
"Unsupervised learning finds patterns in unlabeled data.";
String summary = summarizer.summarize(document, 3);
System.out.println("TextRank Summary: " + summary);
}
}

Approach 4: Integration with External APIs (Hugging Face)

For abstractive summarization using pre-trained models.

import com.fasterxml.jackson.databind.ObjectMapper;
import okhttp3.*;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.TimeUnit;
public class HuggingFaceSummarizer {
private static final String API_URL = "https://api-inference.huggingface.co/models/facebook/bart-large-cnn";
private static final String API_TOKEN = "your_huggingface_token";
private final OkHttpClient client;
private final ObjectMapper mapper;
public HuggingFaceSummarizer() {
this.client = new OkHttpClient.Builder()
.connectTimeout(30, TimeUnit.SECONDS)
.readTimeout(60, TimeUnit.SECONDS)
.build();
this.mapper = new ObjectMapper();
}
public String summarize(String text, int maxLength, int minLength) {
try {
Map<String, Object> payload = new HashMap<>();
payload.put("inputs", text);
payload.put("parameters", Map.of(
"max_length", maxLength,
"min_length", minLength,
"do_sample", false
));
String jsonPayload = mapper.writeValueAsString(payload);
Request request = new Request.Builder()
.url(API_URL)
.post(RequestBody.create(jsonPayload, MediaType.parse("application/json")))
.addHeader("Authorization", "Bearer " + API_TOKEN)
.build();
try (Response response = client.newCall(request).execute()) {
if (!response.isSuccessful()) {
throw new IOException("Unexpected code: " + response);
}
String responseBody = response.body().string();
Map[] result = mapper.readValue(responseBody, Map[].class);
if (result.length > 0) {
return (String) result[0].get("summary_text");
}
}
} catch (IOException e) {
throw new RuntimeException("API call failed", e);
}
return "";
}
public static void main(String[] args) {
HuggingFaceSummarizer summarizer = new HuggingFaceSummarizer();
String longText = "The field of artificial intelligence has seen remarkable progress " +
"in recent years. Deep learning models have achieved human-level performance in " +
"various tasks including image recognition and natural language processing. " +
"Large language models like GPT-4 can generate coherent and contextually relevant text. " +
"These advancements are transforming industries from healthcare to education.";
String summary = summarizer.summarize(longText, 100, 30);
System.out.println("Abstractive Summary: " + summary);
}
}

Performance Optimization and Best Practices

1. Caching and Preprocessing:

public class CachedSummarizer {
private final Map<String, String> summaryCache = new HashMap<>();
private final TextRankSummarizer delegate;
public CachedSummarizer() {
this.delegate = new TextRankSummarizer();
}
public String summarize(String text, int sentenceCount) {
String key = generateKey(text, sentenceCount);
return summaryCache.computeIfAbsent(key, 
k -> delegate.summarize(text, sentenceCount));
}
private String generateKey(String text, int sentenceCount) {
return text.hashCode() + "_" + sentenceCount;
}
}

2. Batch Processing:

public class BatchSummarizer {
private final TFIDFSummarizer summarizer;
public BatchSummarizer() {
this.summarizer = new TFIDFSummarizer();
}
public Map<String, String> summarizeBatch(List<String> documents, int sentenceCount) {
return documents.parallelStream()
.collect(Collectors.toConcurrentMap(
doc -> doc.substring(0, Math.min(50, doc.length())), // key
doc -> summarizer.summarize(doc, sentenceCount)      // value
));
}
}

3. Evaluation Metrics:

public class SummaryEvaluator {
public double evaluateROUGE(String generatedSummary, String referenceSummary) {
// Basic implementation - in practice, use proper ROUGE implementation
Set<String> generatedWords = extractWords(generatedSummary);
Set<String> referenceWords = extractWords(referenceSummary);
Set<String> intersection = new HashSet<>(generatedWords);
intersection.retainAll(referenceWords);
return (double) intersection.size() / referenceWords.size();
}
private Set<String> extractWords(String text) {
return Arrays.stream(text.toLowerCase().split("\\s+"))
.map(word -> word.replaceAll("[^a-zA-Z]", ""))
.filter(word -> word.length() > 0)
.collect(Collectors.toSet());
}
}

Choosing the Right Approach

ApproachBest ForProsCons
TF-IDFSimple documents, quick implementationFast, easy to implementLimited semantic understanding
TextRankWell-structured articlesCaptures semantic relationshipsComputationally intensive
OpenNLPComplex documents with entitiesGood linguistic analysisRequires model files
API-BasedHigh-quality abstractive summariesState-of-the-art resultsExternal dependency, cost

Conclusion

Text summarization in Java offers several powerful approaches:

  • TF-IDF Summarization: Best for simple, fast implementations
  • TextRank Algorithm: Excellent for capturing semantic relationships
  • OpenNLP Integration: Provides sophisticated linguistic analysis
  • API-Based Solutions: Offers cutting-edge abstractive summarization

Key Considerations:

  1. Use Case: Choose extractive for factual content, abstractive for creative content
  2. Performance: Consider computational requirements for your application
  3. Accuracy: Evaluate different approaches with your specific content
  4. Integration: Consider how the summarizer fits into your overall architecture

By leveraging Java's robust ecosystem and performance characteristics, you can build efficient, scalable text summarization systems that meet enterprise requirements while providing meaningful content digestion capabilities for end users.

Leave a Reply

Your email address will not be published. Required fields are marked *


Macro Nepal Helper