Lab3: AI Voice Assistant with Speech-to-Text and Text-to-Speech
Introduction
In this lab, you will build a complete AI voice assistant for the Mini Pupper that combines:
- Speech-to-Text (STT): Convert spoken commands to text using Google Speech API
- Gemini AI: Process natural language and generate responses
- Text-to-Speech (TTS): Convert AI responses to spoken audio
- Movement Control: Execute robot movements based on voice commands
Prerequisites
- Completed Lab1 (Gemini API Setup)
- Completed Lab2 (AI Food App)
- Google Cloud account with Speech-to-Text and Text-to-Speech APIs enabled
- Microphone and speaker connected to Mini Pupper
Part 1: Setup Google Cloud APIs
Enable Required APIs
- Go to Google Cloud Console
- Enable the following APIs:
- Cloud Speech-to-Text API
- Cloud Text-to-Speech API
- Vertex AI API
Create Service Account
- Go to IAM & Admin > Service Accounts
- Create a new service account
- Grant roles:
- Cloud Speech Client
- Cloud Text-to-Speech User
- Vertex AI User
- Create and download JSON key file
Configure Credentials
# Create .env file
nano .env
Add:
API_KEY_PATH=/path/to/your/service-account-key.json
LANGUAGE_CODE=en-US
LANGUAGE_NAME=en-US-Standard-E
Part 2: Install Dependencies
pip install google-cloud-speech google-cloud-texttospeech
pip install langchain-google-vertexai
pip install pyaudio sounddevice soundfile
pip install python-dotenv pillow
Install System Dependencies
sudo apt install portaudio19-dev python3-pyaudio
Part 3: Understanding the Architecture
System Architecture
┌─────────────────────────────────────────────────────────────────────┐
│ AI Voice Assistant │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Microphone │───►│ STT Task │───►│ Input Text Queue │ │
│ │ │ │ (Google │ │ │ │
│ │ │ │ Speech) │ └───────────┬─────────────┘ │
│ └─────────────┘ └─────────────┘ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ Speaker │◄───│ TTS Task │◄───│ Gemini Task │ │
│ │ │ │ (Google │ │ (AI Processing) │ │
│ │ │ │ TTS) │ └───────────┬─────────────┘ │
│ └─────────────┘ └─────────────┘ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ LCD Display │◄───│ Image Task │ │ Movement Task │ │
│ │ │ │ │ │ (Robot Control) │ │
│ └─────────────┘ └─────────────┘ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Task Queues
| Queue | Purpose |
|---|---|
input_text_queue | Voice input text for Gemini |
output_text_queue | AI response text for TTS |
stt_queue | Control STT listening |
movement_queue | Robot movement commands |
image_queue | Images to display on LCD |
gif_queue | GIFs to display |
Part 4: Voice Commands
Movement Commands
move_cmd_functions = {
"action": move_api.init_movement,
"sit": move_api.squat,
"move forwards": move_api.move_forward,
"move backwards": move_api.move_backward,
"move left": move_api.move_left,
"move right": move_api.move_right,
"look up": move_api.look_up,
"look down": move_api.look_down,
"look left": move_api.look_left,
"look right": move_api.look_right,
"look upper left": move_api.look_upperleft,
"look lower left": move_api.look_leftlower,
"look upper right": move_api.look_upperright,
"look lower right": move_api.look_rightlower,
}
System Commands
sys_cmds_functions = {
"shut up": close_ai, # Disable AI responses
"speak please": open_ai, # Enable AI responses
"reboot": reboot, # Reboot the system
"power off": power_off, # Shutdown the system
}
Part 5: Multi-Language Support
The assistant supports multiple languages for TTS:
from google.cloud import texttospeech
# Voice configurations
voice_EN = texttospeech.VoiceSelectionParams(language_code="en-US", name="en-US-Standard-E")
voice_JP = texttospeech.VoiceSelectionParams(language_code="ja-JP", name="ja-JP-Neural2-B")
voice_CN = texttospeech.VoiceSelectionParams(language_code="cmn-CN", name="cmn-CN-Wavenet-A")
voice_FR = texttospeech.VoiceSelectionParams(language_code="fr-FR", name="fr-FR-Standard-C")
voice_DE = texttospeech.VoiceSelectionParams(language_code="de-DE", name="de-DE-Neural2-D")
voice_ES = texttospeech.VoiceSelectionParams(language_code="es-US", name="es-US-Wavenet-A")
lang_voices = {
"Japanese": voice_JP,
"Chinese": voice_CN,
"French": voice_FR,
"German": voice_DE,
"Spanish": voice_ES,
}
Say “translate to Japanese” or “say in French” to switch languages.
Part 6: Core Tasks
Speech-to-Text Task
def stt_task():
"""Listen for voice input and convert to text"""
py_audio = google_api.init_pyaudio()
speech_client = google_api.init_speech_to_text()
while True:
# Wait for signal to start listening
should_stt = stt_queue.get()
stt_queue.task_done()
if not should_stt:
continue
# Listen and transcribe
user_input, stream = google_api.start_speech_to_text(speech_client, py_audio)
# Process command
move_key = get_move_cmd(user_input, move_cmd_functions)
if move_key:
movement_queue.put(move_key)
output_text_queue.put(f"OK, {move_key}.")
else:
input_text_queue.put(user_input)
google_api.stop_speech_to_text(stream)
Gemini AI Task
def gemini_task():
"""Process text input with Gemini AI"""
conversation = google_api.create_conversation()
# Initialize with system prompt
init_prompt = """Answer concisely in a conversational tone.
Keep responses brief, like natural speech."""
google_api.ai_text_response(conversation, init_prompt)
while True:
input_text = input_text_queue.get()
input_text_queue.task_done()
if "photo" in input_text or "picture" in input_text:
# Use vision model
image = media_api.take_photo()
response = google_api.ai_image_response(model, image, input_text)
else:
# Use text model
response = google_api.ai_text_response(conversation, input_text)
output_text_queue.put(response)
Text-to-Speech Task
def tts_task():
"""Convert text responses to speech"""
os.system("amixer -c 0 sset 'Headphone' 100%")
tts_client, voice, audio_config = google_api.init_text_to_speech()
while True:
out_text = output_text_queue.get()
output_text_queue.task_done()
# Remove emojis and special characters
out_text = remove_emojis(out_text).replace('*', '')
if out_text:
google_api.text_to_speech(out_text, tts_client, voice, audio_config)
# Signal STT to start listening again
stt_queue.put(True)
Movement Task
def move_task():
"""Execute robot movement commands"""
while True:
move_command = movement_queue.get()
movement_queue.task_done()
if move_command in move_cmd_functions:
move_cmd_functions[move_command]()
Part 7: Running the Application
Clone the Repository
cd ~
git clone https://github.com/lbaitemple/apps-md-robots.git
cd apps-md-robots
Configure Environment
cp env.sample .env
nano .env
# Add your API_KEY_PATH
Run the Application
python ai_app/ai_app.py
Part 8: Example Interactions
Basic Conversation
You: "Hello, how are you?"
Robot: "I'm doing great! How can I help you today?"
You: "What's the weather like?"
Robot: "I don't have access to weather data, but I can help with other questions!"
Movement Commands
You: "Move forward"
Robot: "OK, moving forward." [Robot moves forward]
You: "Look up"
Robot: "OK, looking up." [Robot looks up]
You: "Sit"
Robot: "OK, sitting down." [Robot sits]
Photo Analysis
You: "Take a picture and tell me what you see"
Robot: [Takes photo] "I can see a living room with a couch and a table..."
Language Switching
You: "Say hello in Japanese"
Robot: "こんにちは!" [Spoken in Japanese]
You: "Translate 'thank you' to French"
Robot: "Merci!" [Spoken in French]
Part 9: Customization
Add Custom Commands
# Add to move_cmd_functions
move_cmd_functions["dance"] = move_api.dance
move_cmd_functions["wave"] = move_api.wave_paw
# Add to sys_cmds_functions
sys_cmds_functions["tell me a joke"] = tell_joke
Change Default Voice
Edit .env:
LANGUAGE_CODE=en-GB
LANGUAGE_NAME=en-GB-Neural2-B
Adjust Response Style
Modify the initialization prompt in gemini_task():
init_prompt = """You are a friendly robot dog named Pupper.
Be playful and enthusiastic in your responses.
Keep answers short and fun!"""
Exercises
Exercise 1: Add Weather Command
Integrate a weather API to respond to weather queries.
Exercise 2: Custom Wake Word
Implement a wake word (like “Hey Pupper”) before processing commands.
Exercise 3: Emotion Display
Show different images on the LCD based on the conversation mood.
Exercise 4: Music Player
Add commands to play music files from the robot.
Troubleshooting
| Issue | Solution |
|---|---|
| No audio input | Check microphone: arecord -l |
| No audio output | Check speaker: aplay -l, verify volume |
| API errors | Verify credentials path in .env |
| Slow response | Check internet connection |
| Commands not recognized | Speak clearly, check language settings |
Debug Mode
Enable detailed logging:
logging.basicConfig(level=logging.DEBUG)
Summary
In this lab, you learned:
- How to set up Google Speech-to-Text and Text-to-Speech APIs
- How to build a multi-threaded voice assistant
- How to integrate voice commands with robot movement
- How to support multiple languages
- How to combine vision and voice for interactive AI