Building Intelligent Voice Assistants with Java: From Speech to Action

Voice assistants have become ubiquitous in our daily lives, powering everything from smart home devices to customer service systems. While Python often dominates the AI conversation, Java provides a robust, scalable platform for building enterprise-grade voice assistants. This article explores how to create a complete voice assistant in Java, covering speech recognition, natural language processing, and voice synthesis.

Table of Contents

Architecture of a Java Voice Assistant

A typical voice assistant follows this pipeline:

Voice Input → Speech Recognition → NLP Processing → Command Execution → Text-to-Speech

Key Components:

Speech Recognition: Convert audio to text
Natural Language Processing: Understand user intent
Command Dispatcher: Execute appropriate actions
Text-to-Speech: Convert responses to audio
Wake Word Detection: Activate on specific triggers

Setting Up Dependencies

Maven Dependencies:

<properties>
<sphinx4.version>5prealpha-SNAPSHOT</sphinx4.version>
<openai.version>0.16.1</openai.version>
<freetts.version>1.2.2</freetts.version>
</properties>
<dependencies>
<!-- CMU Sphinx for Speech Recognition -->
<dependency>
<groupId>edu.cmu.sphinx</groupId>
<artifactId>sphinx4-core</artifactId>
<version>${sphinx4.version}</version>
</dependency>
<dependency>
<groupId>edu.cmu.sphinx</groupId>
<artifactId>sphinx4-data</artifactId>
<version>${sphinx4.version}</version>
</dependency>
<!-- FreeTTS for Text-to-Speech -->
<dependency>
<groupId>com.sun.speech</groupId>
<artifactId>freetts</artifactId>
<version>${freetts.version}</version>
</dependency>
<!-- OpenAI for advanced NLP -->
<dependency>
<groupId>com.theokanning.openai-gpt3-java</groupId>
<artifactId>service</artifactId>
<version>${openai.version}</version>
</dependency>
<!-- Audio processing -->
<dependency>
<groupId>javax.sound</groupId>
<artifactId>javax.sound-api</artifactId>
<version>1.0</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>snapshots</id>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
</repository>
</repositories>

Gradle:

dependencies {
implementation("edu.cmu.sphinx:sphinx4-core:5prealpha-SNAPSHOT")
implementation("edu.cmu.sphinx:sphinx4-data:5prealpha-SNAPSHOT")
implementation("com.sun.speech:freetts:1.2.2")
implementation("com.theokanning.openai-gpt3-java:service:0.16.1")
}

Component 1: Speech Recognition with CMU Sphinx

import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.LiveSpeechRecognizer;
import edu.cmu.sphinx.api.SpeechResult;
import java.io.IOException;
public class SpeechRecognitionEngine {
private LiveSpeechRecognizer recognizer;
private boolean isListening = false;
public SpeechRecognitionEngine() {
initializeRecognizer();
}
private void initializeRecognizer() {
try {
Configuration configuration = new Configuration();
// Set path to acoustic model
configuration.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
configuration.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin");
// Optional: Use grammar for specific commands
configuration.setGrammarPath("resource:/grammars");
configuration.setGrammarName("grammar");
configuration.setUseGrammar(true);
recognizer = new LiveSpeechRecognizer(configuration);
} catch (IOException e) {
System.err.println("Failed to initialize speech recognizer: " + e.getMessage());
}
}
public String listenForSpeech() {
if (recognizer == null) {
return "Speech recognizer not initialized";
}
System.out.println("Listening... Speak now!");
recognizer.startRecognition(true);
SpeechResult result = recognizer.getResult();
recognizer.stopRecognition();
if (result != null) {
String hypothesis = result.getHypothesis();
System.out.println("Recognized: " + hypothesis);
return hypothesis;
}
return "";
}
public void startContinuousListening(SpeechCallback callback) {
if (isListening) return;
isListening = true;
new Thread(() -> {
recognizer.startRecognition(true);
while (isListening) {
SpeechResult result = recognizer.getResult();
if (result != null) {
String text = result.getHypothesis();
if (text != null && !text.isEmpty()) {
callback.onSpeechRecognized(text);
}
}
try {
Thread.sleep(100);
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
break;
}
}
recognizer.stopRecognition();
}).start();
}
public void stopContinuousListening() {
isListening = false;
}
public interface SpeechCallback {
void onSpeechRecognized(String text);
}
public static void main(String[] args) {
SpeechRecognitionEngine engine = new SpeechRecognitionEngine();
// Test single recognition
String result = engine.listenForSpeech();
System.out.println("You said: " + result);
}
}

Component 2: Text-to-Speech with FreeTTS

import com.sun.speech.freetts.Voice;
import com.sun.speech.freetts.VoiceManager;
import javax.sound.sampled.AudioSystem;
import javax.sound.sampled.Clip;
import javax.sound.sampled.LineUnavailableException;
import javax.sound.sampled.UnsupportedAudioFileException;
import java.io.File;
import java.io.IOException;
public class TextToSpeechEngine {
private Voice voice;
private static final String VOICE_NAME = "kevin16";
public TextToSpeechEngine() {
initializeVoice();
}
private void initializeVoice() {
VoiceManager voiceManager = VoiceManager.getInstance();
voice = voiceManager.getVoice(VOICE_NAME);
if (voice == null) {
System.err.println("Cannot find voice: " + VOICE_NAME);
return;
}
voice.allocate();
}
public void speak(String text) {
if (voice == null) {
System.err.println("Voice not initialized");
return;
}
System.out.println("Assistant: " + text);
// Speak in a separate thread to avoid blocking
new Thread(() -> {
voice.speak(text);
}).start();
}
public void speakWithEmotion(String text, String emotion) {
// Adjust voice parameters based on emotion
switch (emotion.toLowerCase()) {
case "happy":
voice.setPitch(180); // Higher pitch
voice.setPitchRange(15);
break;
case "serious":
voice.setPitch(100); // Lower pitch
voice.setPitchRange(5);
break;
case "excited":
voice.setRate(180); // Faster rate
break;
default:
voice.setPitch(150);
voice.setPitchRange(12);
voice.setRate(150);
}
speak(text);
// Reset to default
voice.setPitch(150);
voice.setPitchRange(12);
voice.setRate(150);
}
public void playSound(String audioFilePath) {
try {
Clip clip = AudioSystem.getClip();
clip.open(AudioSystem.getAudioInputStream(new File(audioFilePath)));
clip.start();
} catch (LineUnavailableException | IOException | UnsupportedAudioFileException e) {
System.err.println("Error playing sound: " + e.getMessage());
}
}
public void dispose() {
if (voice != null) {
voice.deallocate();
}
}
public static void main(String[] args) {
TextToSpeechEngine tts = new TextToSpeechEngine();
tts.speak("Hello! I am your Java voice assistant. How can I help you today?");
try {
Thread.sleep(5000); // Wait for speech to complete
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
tts.dispose();
}
}

Component 3: Natural Language Processing Engine

import java.util.*;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class NLPEngine {
private Map<String, List<Pattern>> intentPatterns;
public NLPEngine() {
initializeIntentPatterns();
}
private void initializeIntentPatterns() {
intentPatterns = new HashMap<>();
// Greeting patterns
List<Pattern> greetingPatterns = Arrays.asList(
Pattern.compile("(?i)(hello|hi|hey|greetings).*"),
Pattern.compile("(?i).*(good morning|good afternoon|good evening).*")
);
intentPatterns.put("GREETING", greetingPatterns);
// Time patterns
List<Pattern> timePatterns = Arrays.asList(
Pattern.compile("(?i).*what.*time.*"),
Pattern.compile("(?i).*current time.*"),
Pattern.compile("(?i).*tell me the time.*")
);
intentPatterns.put("GET_TIME", timePatterns);
// Weather patterns
List<Pattern> weatherPatterns = Arrays.asList(
Pattern.compile("(?i).*weather.*"),
Pattern.compile("(?i).*temperature.*"),
Pattern.compile("(?i)how.*weather.*")
);
intentPatterns.put("GET_WEATHER", weatherPatterns);
// Calculation patterns
List<Pattern> calculationPatterns = Arrays.asList(
Pattern.compile("(?i).*calculate.*"),
Pattern.compile("(?i).*what is (\\d+).*(\\+|plus|minus|-|times|\\*|divided by|/).*(\\d+).*"),
Pattern.compile("(?i).*compute.*")
);
intentPatterns.put("CALCULATE", calculationPatterns);
// Joke patterns
List<Pattern> jokePatterns = Arrays.asList(
Pattern.compile("(?i).*tell.*joke.*"),
Pattern.compile("(?i).*make me laugh.*"),
Pattern.compile("(?i).*funny.*")
);
intentPatterns.put("TELL_JOKE", jokePatterns);
// Exit patterns
List<Pattern> exitPatterns = Arrays.asList(
Pattern.compile("(?i).*exit.*"),
Pattern.compile("(?i).*quit.*"),
Pattern.compile("(?i).*goodbye.*"),
Pattern.compile("(?i).*stop.*")
);
intentPatterns.put("EXIT", exitPatterns);
}
public IntentResult processText(String text) {
if (text == null || text.trim().isEmpty()) {
return new IntentResult("UNKNOWN", Collections.emptyMap(), text);
}
// Check each intent pattern
for (Map.Entry<String, List<Pattern>> entry : intentPatterns.entrySet()) {
String intent = entry.getKey();
List<Pattern> patterns = entry.getValue();
for (Pattern pattern : patterns) {
Matcher matcher = pattern.matcher(text);
if (matcher.matches()) {
Map<String, String> entities = extractEntities(text, intent);
return new IntentResult(intent, entities, text);
}
}
}
return new IntentResult("UNKNOWN", Collections.emptyMap(), text);
}
private Map<String, String> extractEntities(String text, String intent) {
Map<String, String> entities = new HashMap<>();
switch (intent) {
case "CALCULATE":
extractMathEntities(text, entities);
break;
case "GET_WEATHER":
extractLocationEntities(text, entities);
break;
}
return entities;
}
private void extractMathEntities(String text, Map<String, String> entities) {
// Extract numbers and operators
Pattern mathPattern = Pattern.compile("(\\d+)\\s*(\\+|plus|minus|-|times|\\*|divided by|/)\\s*(\\d+)");
Matcher matcher = mathPattern.matcher(text.toLowerCase());
if (matcher.find()) {
entities.put("number1", matcher.group(1));
entities.put("operator", matcher.group(2));
entities.put("number2", matcher.group(3));
}
}
private void extractLocationEntities(String text, Map<String, String> entities) {
// Simple location extraction (in real app, use NER)
String[] locations = {"new york", "london", "paris", "tokyo", "berlin"};
for (String location : locations) {
if (text.toLowerCase().contains(location)) {
entities.put("location", location);
break;
}
}
if (!entities.containsKey("location")) {
entities.put("location", "current location");
}
}
public static class IntentResult {
private final String intent;
private final Map<String, String> entities;
private final String originalText;
public IntentResult(String intent, Map<String, String> entities, String originalText) {
this.intent = intent;
this.entities = entities;
this.originalText = originalText;
}
// Getters
public String getIntent() { return intent; }
public Map<String, String> getEntities() { return entities; }
public String getOriginalText() { return originalText; }
@Override
public String toString() {
return String.format("Intent: %s, Entities: %s", intent, entities);
}
}
public static void main(String[] args) {
NLPEngine nlp = new NLPEngine();
String[] testPhrases = {
"Hello there!",
"What time is it?",
"What's the weather in London?",
"Calculate 15 plus 27",
"Tell me a joke",
"Goodbye"
};
for (String phrase : testPhrases) {
IntentResult result = nlp.processText(phrase);
System.out.println("'" + phrase + "' -> " + result);
}
}
}

Component 4: Command Dispatcher and Action Engine

import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.*;
public class CommandDispatcher {
private final TextToSpeechEngine tts;
private final Map<String, CommandHandler> commandHandlers;
public CommandDispatcher(TextToSpeechEngine tts) {
this.tts = tts;
this.commandHandlers = new HashMap<>();
initializeHandlers();
}
private void initializeHandlers() {
commandHandlers.put("GREETING", this::handleGreeting);
commandHandlers.put("GET_TIME", this::handleGetTime);
commandHandlers.put("GET_WEATHER", this::handleGetWeather);
commandHandlers.put("CALCULATE", this::handleCalculate);
commandHandlers.put("TELL_JOKE", this::handleTellJoke);
commandHandlers.put("EXIT", this::handleExit);
commandHandlers.put("UNKNOWN", this::handleUnknown);
}
public void dispatch(NLPEngine.IntentResult intentResult) {
CommandHandler handler = commandHandlers.get(intentResult.getIntent());
if (handler != null) {
handler.handle(intentResult);
} else {
handleUnknown(intentResult);
}
}
private void handleGreeting(NLPEngine.IntentResult intentResult) {
String[] greetings = {
"Hello! How can I assist you today?",
"Hi there! What can I do for you?",
"Greetings! I'm here to help.",
"Hello! Nice to hear from you."
};
String response = greetings[new Random().nextInt(greetings.length)];
tts.speakWithEmotion(response, "happy");
}
private void handleGetTime(NLPEngine.IntentResult intentResult) {
LocalDateTime now = LocalDateTime.now();
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("hh:mm a");
String time = now.format(formatter);
tts.speak("The current time is " + time);
}
private void handleGetWeather(NLPEngine.IntentResult intentResult) {
String location = intentResult.getEntities().getOrDefault("location", "your location");
// Mock weather data - in real app, call weather API
String[] weatherConditions = {"sunny", "cloudy", "rainy", "partly cloudy"};
int temperature = new Random().nextInt(30) + 10; // 10-40°C
String condition = weatherConditions[new Random().nextInt(weatherConditions.length)];
String response = String.format(
"The weather in %s is %s with a temperature of %d degrees Celsius", 
location, condition, temperature
);
tts.speak(response);
}
private void handleCalculate(NLPEngine.IntentResult intentResult) {
Map<String, String> entities = intentResult.getEntities();
if (entities.containsKey("number1") && entities.containsKey("number2") && 
entities.containsKey("operator")) {
try {
double num1 = Double.parseDouble(entities.get("number1"));
double num2 = Double.parseDouble(entities.get("number2"));
String operator = entities.get("operator");
double result = 0;
switch (operator.toLowerCase()) {
case "+":
case "plus":
result = num1 + num2;
break;
case "-":
case "minus":
result = num1 - num2;
break;
case "*":
case "times":
result = num1 * num2;
break;
case "/":
case "divided by":
if (num2 != 0) {
result = num1 / num2;
} else {
tts.speak("I cannot divide by zero");
return;
}
break;
}
tts.speak(String.format("%.2f %s %.2f equals %.2f", num1, operator, num2, result));
} catch (NumberFormatException e) {
tts.speak("I couldn't understand those numbers");
}
} else {
tts.speak("I didn't catch the calculation details. Please try again.");
}
}
private void handleTellJoke(NLPEngine.IntentResult intentResult) {
String[] jokes = {
"Why don't scientists trust atoms? Because they make up everything!",
"Why did the scarecrow win an award? He was outstanding in his field!",
"Why don't eggs tell jokes? They'd crack each other up!",
"What do you call a fake noodle? An impasta!",
"Why did the math book look so sad? Because it had too many problems!"
};
String joke = jokes[new Random().nextInt(jokes.length)];
tts.speakWithEmotion(joke, "excited");
}
private void handleExit(NLPEngine.IntentResult intentResult) {
tts.speakWithEmotion("Goodbye! Have a great day!", "happy");
System.exit(0);
}
private void handleUnknown(NLPEngine.IntentResult intentResult) {
String[] responses = {
"I'm not sure I understand. Could you rephrase that?",
"I didn't catch that. Can you try again?",
"I'm still learning. Could you say that differently?",
"I'm not programmed to handle that request yet."
};
String response = responses[new Random().nextInt(responses.length)];
tts.speak(response);
}
@FunctionalInterface
private interface CommandHandler {
void handle(NLPEngine.IntentResult intentResult);
}
}

Complete Voice Assistant Integration

import java.util.Scanner;
public class JavaVoiceAssistant {
private final SpeechRecognitionEngine speechRecognition;
private final TextToSpeechEngine textToSpeech;
private final NLPEngine nlpEngine;
private final CommandDispatcher dispatcher;
private boolean isRunning = false;
public JavaVoiceAssistant() {
System.out.println("Initializing Java Voice Assistant...");
this.textToSpeech = new TextToSpeechEngine();
this.speechRecognition = new SpeechRecognitionEngine();
this.nlpEngine = new NLPEngine();
this.dispatcher = new CommandDispatcher(textToSpeech);
System.out.println("Voice Assistant initialized successfully!");
}
public void start() {
isRunning = true;
textToSpeech.speak("Java Voice Assistant activated. How can I help you?");
// Start continuous listening
speechRecognition.startContinuousListening(this::processVoiceInput);
System.out.println("Voice Assistant is running. Say 'exit' to quit.");
// Keep main thread alive
Scanner scanner = new Scanner(System.in);
while (isRunning) {
System.out.print("Press Enter to stop...");
scanner.nextLine();
break;
}
stop();
}
private void processVoiceInput(String recognizedText) {
if (recognizedText == null || recognizedText.trim().isEmpty()) {
return;
}
System.out.println("User: " + recognizedText);
// Process with NLP
NLPEngine.IntentResult intentResult = nlpEngine.processText(recognizedText);
System.out.println("NLP Result: " + intentResult);
// Dispatch to appropriate handler
dispatcher.dispatch(intentResult);
// Check for exit command
if ("EXIT".equals(intentResult.getIntent())) {
isRunning = false;
}
}
public void stop() {
isRunning = false;
speechRecognition.stopContinuousListening();
textToSpeech.dispose();
System.out.println("Voice Assistant stopped.");
}
public void processTextInput(String text) {
processVoiceInput(text);
}
public static void main(String[] args) {
JavaVoiceAssistant assistant = new JavaVoiceAssistant();
// Add shutdown hook for cleanup
Runtime.getRuntime().addShutdownHook(new Thread(assistant::stop));
// Test with text input first
if (args.length > 0 && "text".equals(args[0])) {
System.out.println("Text mode activated. Type your commands:");
Scanner scanner = new Scanner(System.in);
while (true) {
System.out.print("You: ");
String input = scanner.nextLine();
if ("exit".equalsIgnoreCase(input)) {
break;
}
assistant.processTextInput(input);
}
} else {
// Start voice mode
assistant.start();
}
}
}

Advanced Features

1. Wake Word Detection:

public class WakeWordDetector {
private static final String WAKE_WORD = "assistant";
private final SpeechRecognitionEngine recognizer;
public WakeWordDetector() {
this.recognizer = new SpeechRecognitionEngine();
}
public void startWakeWordDetection(Runnable onWakeWordDetected) {
recognizer.startContinuousListening(text -> {
if (text.toLowerCase().contains(WAKE_WORD)) {
System.out.println("Wake word detected!");
onWakeWordDetected.run();
}
});
}
}

2. Conversation Context Manager:

public class ConversationContext {
private final Map<String, Object> context;
private final List<String> conversationHistory;
public ConversationContext() {
this.context = new HashMap<>();
this.conversationHistory = new ArrayList<>();
}
public void addToHistory(String userInput, String assistantResponse) {
conversationHistory.add("User: " + userInput);
conversationHistory.add("Assistant: " + assistantResponse);
// Keep only recent history
if (conversationHistory.size() > 20) {
conversationHistory.subList(0, 4).clear(); // Remove oldest 2 exchanges
}
}
public String getConversationContext() {
return String.join("\n", conversationHistory);
}
public void setContext(String key, Object value) {
context.put(key, value);
}
public Object getContext(String key) {
return context.get(key);
}
}

3. Skill System for Extensibility:

public interface AssistantSkill {
String getName();
boolean canHandle(String intent);
String execute(Map<String, String> entities);
String getDescription();
}
public class MusicSkill implements AssistantSkill {
@Override
public String getName() { return "Music Player"; }
@Override
public boolean canHandle(String intent) {
return "PLAY_MUSIC".equals(intent) || "STOP_MUSIC".equals(intent);
}
@Override
public String execute(Map<String, String> entities) {
String song = entities.get("song");
if (song != null) {
return "Playing " + song;
}
return "Playing music";
}
@Override
public String getDescription() { return "Plays music and controls playback"; }
}

Performance Optimization

1. Audio Buffer Management:

public class AudioBufferManager {
private final int bufferSize = 4096;
private final byte[] buffer;
public AudioBufferManager() {
this.buffer = new byte[bufferSize];
}
public void processAudio(byte[] audioData) {
// Implement circular buffer for continuous audio processing
System.arraycopy(audioData, 0, buffer, 0, Math.min(audioData.length, bufferSize));
}
}

2. Thread Pool for Concurrent Processing:

public class AssistantThreadPool {
private final ExecutorService recognitionPool;
private final ExecutorService nlpPool;
private final ExecutorService ttsPool;
public AssistantThreadPool() {
this.recognitionPool = Executors.newFixedThreadPool(2);
this.nlpPool = Executors.newFixedThreadPool(2);
this.ttsPool = Executors.newSingleThreadExecutor(); // TTS should be sequential
}
public void submitRecognitionTask(Runnable task) {
recognitionPool.submit(task);
}
public void submitNLPTask(Runnable task) {
nlpPool.submit(task);
}
public void submitTTSTask(Runnable task) {
ttsPool.submit(task);
}
}

Best Practices

Error Handling:

public class AssistantErrorHandler {
public static void handleRecognitionError(Exception e) {
System.err.println("Speech recognition error: " + e.getMessage());
// Fall back to text input or retry
}
public static void handleTTSError(Exception e) {
System.err.println("Text-to-speech error: " + e.getMessage());
// Fall back to text output
}
}

Configuration Management:

public class AssistantConfig {
private final Properties properties;
public AssistantConfig() {
properties = new Properties();
try {
properties.load(getClass().getResourceAsStream("/assistant.properties"));
} catch (IOException e) {
// Use defaults
setDefaults();
}
}
private void setDefaults() {
properties.setProperty("wake.word", "assistant");
properties.setProperty("tts.voice", "kevin16");
properties.setProperty("recognition.timeout", "5000");
}
public String getWakeWord() {
return properties.getProperty("wake.word");
}
}

Conclusion

Building a voice assistant in Java offers several advantages:

Enterprise Integration: Seamlessly integrates with existing Java systems
Performance: Efficient resource management for continuous operation
Scalability: Handles multiple users and concurrent requests
Extensibility: Modular architecture for adding new skills

Key Components Summary:

CMU Sphinx: Reliable speech recognition
FreeTTS: Capable text-to-speech synthesis
Custom NLP: Intent recognition and entity extraction
Modular Dispatcher: Clean separation of concerns

Next Steps for Enhancement:

Integrate with cloud services (Google Speech-to-Text, AWS Polly)
Add machine learning for improved intent recognition
Implement voice biometrics for user identification
Add multimodal interactions (GUI + voice)
Develop mobile companion apps

By leveraging Java's robust ecosystem and performance characteristics, you can build sophisticated voice assistants that meet enterprise requirements while providing natural, intuitive user interactions.