Lab3: AI Voice Assistant with Speech-to-Text and Text-to-Speech

Introduction

In this lab, you will build a complete AI voice assistant for the Mini Pupper that combines:

Speech-to-Text (STT): Convert spoken commands to text using Google Speech API
Gemini AI: Process natural language and generate responses
Text-to-Speech (TTS): Convert AI responses to spoken audio
Movement Control: Execute robot movements based on voice commands

Prerequisites

Completed Lab1 (Gemini API Setup)
Completed Lab2 (AI Food App)
Google Cloud account with Speech-to-Text and Text-to-Speech APIs enabled
Microphone and speaker connected to Mini Pupper

Part 1: Setup Google Cloud APIs

Enable Required APIs

Go to Google Cloud Console
Enable the following APIs:
- Cloud Speech-to-Text API
- Cloud Text-to-Speech API
- Vertex AI API

Create Service Account

Go to IAM & Admin > Service Accounts
Create a new service account
Grant roles:
- Cloud Speech Client
- Cloud Text-to-Speech User
- Vertex AI User
Create and download JSON key file

Configure Credentials

# Create .env file
nano .env

Add:

API_KEY_PATH=/path/to/your/service-account-key.json
LANGUAGE_CODE=en-US
LANGUAGE_NAME=en-US-Standard-E

Part 2: Install Dependencies

pip install google-cloud-speech google-cloud-texttospeech
pip install langchain-google-vertexai
pip install pyaudio sounddevice soundfile
pip install python-dotenv pillow

Install System Dependencies

sudo apt install portaudio19-dev python3-pyaudio

Part 3: Understanding the Architecture

System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        AI Voice Assistant                           │
│                                                                     │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐ │
│  │ Microphone  │───►│ STT Task    │───►│ Input Text Queue        │ │
│  │             │    │ (Google     │    │                         │ │
│  │             │    │  Speech)    │    └───────────┬─────────────┘ │
│  └─────────────┘    └─────────────┘                │               │
│                                                     ▼               │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐ │
│  │ Speaker     │◄───│ TTS Task    │◄───│ Gemini Task             │ │
│  │             │    │ (Google     │    │ (AI Processing)         │ │
│  │             │    │  TTS)       │    └───────────┬─────────────┘ │
│  └─────────────┘    └─────────────┘                │               │
│                                                     ▼               │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐ │
│  │ LCD Display │◄───│ Image Task  │    │ Movement Task           │ │
│  │             │    │             │    │ (Robot Control)         │ │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Task Queues

Queue	Purpose
`input_text_queue`	Voice input text for Gemini
`output_text_queue`	AI response text for TTS
`stt_queue`	Control STT listening
`movement_queue`	Robot movement commands
`image_queue`	Images to display on LCD
`gif_queue`	GIFs to display

Part 4: Voice Commands

Movement Commands

move_cmd_functions = {
    "action": move_api.init_movement,
    "sit": move_api.squat,
    "move forwards": move_api.move_forward,
    "move backwards": move_api.move_backward,
    "move left": move_api.move_left,
    "move right": move_api.move_right,
    "look up": move_api.look_up,
    "look down": move_api.look_down,
    "look left": move_api.look_left,
    "look right": move_api.look_right,
    "look upper left": move_api.look_upperleft,
    "look lower left": move_api.look_leftlower,
    "look upper right": move_api.look_upperright,
    "look lower right": move_api.look_rightlower,
}

System Commands

sys_cmds_functions = {
    "shut up": close_ai,      # Disable AI responses
    "speak please": open_ai,  # Enable AI responses
    "reboot": reboot,         # Reboot the system
    "power off": power_off,   # Shutdown the system
}

Part 5: Multi-Language Support

The assistant supports multiple languages for TTS:

from google.cloud import texttospeech

# Voice configurations
voice_EN = texttospeech.VoiceSelectionParams(language_code="en-US", name="en-US-Standard-E")
voice_JP = texttospeech.VoiceSelectionParams(language_code="ja-JP", name="ja-JP-Neural2-B")
voice_CN = texttospeech.VoiceSelectionParams(language_code="cmn-CN", name="cmn-CN-Wavenet-A")
voice_FR = texttospeech.VoiceSelectionParams(language_code="fr-FR", name="fr-FR-Standard-C")
voice_DE = texttospeech.VoiceSelectionParams(language_code="de-DE", name="de-DE-Neural2-D")
voice_ES = texttospeech.VoiceSelectionParams(language_code="es-US", name="es-US-Wavenet-A")

lang_voices = {
    "Japanese": voice_JP,
    "Chinese": voice_CN,
    "French": voice_FR,
    "German": voice_DE,
    "Spanish": voice_ES,
}

Say “translate to Japanese” or “say in French” to switch languages.

Part 6: Core Tasks

Speech-to-Text Task

def stt_task():
    """Listen for voice input and convert to text"""
    py_audio = google_api.init_pyaudio()
    speech_client = google_api.init_speech_to_text()
    
    while True:
        # Wait for signal to start listening
        should_stt = stt_queue.get()
        stt_queue.task_done()
        
        if not should_stt:
            continue
            
        # Listen and transcribe
        user_input, stream = google_api.start_speech_to_text(speech_client, py_audio)
        
        # Process command
        move_key = get_move_cmd(user_input, move_cmd_functions)
        
        if move_key:
            movement_queue.put(move_key)
            output_text_queue.put(f"OK, {move_key}.")
        else:
            input_text_queue.put(user_input)
        
        google_api.stop_speech_to_text(stream)

Gemini AI Task

def gemini_task():
    """Process text input with Gemini AI"""
    conversation = google_api.create_conversation()
    
    # Initialize with system prompt
    init_prompt = """Answer concisely in a conversational tone. 
    Keep responses brief, like natural speech."""
    google_api.ai_text_response(conversation, init_prompt)
    
    while True:
        input_text = input_text_queue.get()
        input_text_queue.task_done()
        
        if "photo" in input_text or "picture" in input_text:
            # Use vision model
            image = media_api.take_photo()
            response = google_api.ai_image_response(model, image, input_text)
        else:
            # Use text model
            response = google_api.ai_text_response(conversation, input_text)
        
        output_text_queue.put(response)

Text-to-Speech Task

def tts_task():
    """Convert text responses to speech"""
    os.system("amixer -c 0 sset 'Headphone' 100%")
    tts_client, voice, audio_config = google_api.init_text_to_speech()
    
    while True:
        out_text = output_text_queue.get()
        output_text_queue.task_done()
        
        # Remove emojis and special characters
        out_text = remove_emojis(out_text).replace('*', '')
        
        if out_text:
            google_api.text_to_speech(out_text, tts_client, voice, audio_config)
        
        # Signal STT to start listening again
        stt_queue.put(True)

Movement Task

def move_task():
    """Execute robot movement commands"""
    while True:
        move_command = movement_queue.get()
        movement_queue.task_done()
        
        if move_command in move_cmd_functions:
            move_cmd_functions[move_command]()

Part 7: Running the Application

Clone the Repository

cd ~
git clone https://github.com/lbaitemple/apps-md-robots.git
cd apps-md-robots

Configure Environment

cp env.sample .env
nano .env
# Add your API_KEY_PATH

Run the Application

python ai_app/ai_app.py

Part 8: Example Interactions

Basic Conversation

You: "Hello, how are you?"
Robot: "I'm doing great! How can I help you today?"

You: "What's the weather like?"
Robot: "I don't have access to weather data, but I can help with other questions!"

Movement Commands

You: "Move forward"
Robot: "OK, moving forward." [Robot moves forward]

You: "Look up"
Robot: "OK, looking up." [Robot looks up]

You: "Sit"
Robot: "OK, sitting down." [Robot sits]

Photo Analysis

You: "Take a picture and tell me what you see"
Robot: [Takes photo] "I can see a living room with a couch and a table..."

Language Switching

You: "Say hello in Japanese"
Robot: "こんにちは！" [Spoken in Japanese]

You: "Translate 'thank you' to French"
Robot: "Merci!" [Spoken in French]

Part 9: Customization

Add Custom Commands

# Add to move_cmd_functions
move_cmd_functions["dance"] = move_api.dance
move_cmd_functions["wave"] = move_api.wave_paw

# Add to sys_cmds_functions
sys_cmds_functions["tell me a joke"] = tell_joke

Change Default Voice

Edit .env:

LANGUAGE_CODE=en-GB
LANGUAGE_NAME=en-GB-Neural2-B

Adjust Response Style

Modify the initialization prompt in gemini_task():

init_prompt = """You are a friendly robot dog named Pupper. 
Be playful and enthusiastic in your responses. 
Keep answers short and fun!"""

Exercises

Exercise 1: Add Weather Command

Integrate a weather API to respond to weather queries.

Exercise 2: Custom Wake Word

Implement a wake word (like “Hey Pupper”) before processing commands.

Exercise 3: Emotion Display

Show different images on the LCD based on the conversation mood.

Exercise 4: Music Player

Add commands to play music files from the robot.

Troubleshooting

Issue	Solution
No audio input	Check microphone: `arecord -l`
No audio output	Check speaker: `aplay -l`, verify volume
API errors	Verify credentials path in `.env`
Slow response	Check internet connection
Commands not recognized	Speak clearly, check language settings

Debug Mode

Enable detailed logging:

logging.basicConfig(level=logging.DEBUG)

Summary

In this lab, you learned:

How to set up Google Speech-to-Text and Text-to-Speech APIs
How to build a multi-threaded voice assistant
How to integrate voice commands with robot movement
How to support multiple languages
How to combine vision and voice for interactive AI

Lab3: AI Voice Assistant with Speech-to-Text and Text-to-Speech

Introduction

Prerequisites

Part 1: Setup Google Cloud APIs

Enable Required APIs

Create Service Account

Configure Credentials

Part 2: Install Dependencies

Install System Dependencies

Part 3: Understanding the Architecture

System Architecture

Task Queues

Part 4: Voice Commands

Movement Commands

System Commands

Part 5: Multi-Language Support

Part 6: Core Tasks

Speech-to-Text Task

Gemini AI Task

Text-to-Speech Task

Movement Task

Part 7: Running the Application

Clone the Repository

Configure Environment

Run the Application

Part 8: Example Interactions

Basic Conversation

Movement Commands

Photo Analysis

Language Switching

Part 9: Customization

Add Custom Commands

Change Default Voice

Adjust Response Style

Exercises

Exercise 1: Add Weather Command

Exercise 2: Custom Wake Word

Exercise 3: Emotion Display

Exercise 4: Music Player

Troubleshooting

Debug Mode

Summary

Reference