Lab3: AI Voice Assistant with Speech-to-Text and Text-to-Speech

Introduction

In this lab, you will build a complete AI voice assistant for the Mini Pupper that combines:

  • Speech-to-Text (STT): Convert spoken commands to text using Google Speech API
  • Gemini AI: Process natural language and generate responses
  • Text-to-Speech (TTS): Convert AI responses to spoken audio
  • Movement Control: Execute robot movements based on voice commands

Prerequisites

  • Completed Lab1 (Gemini API Setup)
  • Completed Lab2 (AI Food App)
  • Google Cloud account with Speech-to-Text and Text-to-Speech APIs enabled
  • Microphone and speaker connected to Mini Pupper

Part 1: Setup Google Cloud APIs

Enable Required APIs

  1. Go to Google Cloud Console
  2. Enable the following APIs:
    • Cloud Speech-to-Text API
    • Cloud Text-to-Speech API
    • Vertex AI API

Create Service Account

  1. Go to IAM & Admin > Service Accounts
  2. Create a new service account
  3. Grant roles:
    • Cloud Speech Client
    • Cloud Text-to-Speech User
    • Vertex AI User
  4. Create and download JSON key file

Configure Credentials

# Create .env file
nano .env

Add:

API_KEY_PATH=/path/to/your/service-account-key.json
LANGUAGE_CODE=en-US
LANGUAGE_NAME=en-US-Standard-E

Part 2: Install Dependencies

pip install google-cloud-speech google-cloud-texttospeech
pip install langchain-google-vertexai
pip install pyaudio sounddevice soundfile
pip install python-dotenv pillow

Install System Dependencies

sudo apt install portaudio19-dev python3-pyaudio

Part 3: Understanding the Architecture

System Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                        AI Voice Assistant                           │
│                                                                     │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐ │
│  │ Microphone  │───►│ STT Task    │───►│ Input Text Queue        │ │
│  │             │    │ (Google     │    │                         │ │
│  │             │    │  Speech)    │    └───────────┬─────────────┘ │
│  └─────────────┘    └─────────────┘                │               │
│                                                     ▼               │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐ │
│  │ Speaker     │◄───│ TTS Task    │◄───│ Gemini Task             │ │
│  │             │    │ (Google     │    │ (AI Processing)         │ │
│  │             │    │  TTS)       │    └───────────┬─────────────┘ │
│  └─────────────┘    └─────────────┘                │               │
│                                                     ▼               │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────────┐ │
│  │ LCD Display │◄───│ Image Task  │    │ Movement Task           │ │
│  │             │    │             │    │ (Robot Control)         │ │
│  └─────────────┘    └─────────────┘    └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘

Task Queues

Queue Purpose
input_text_queue Voice input text for Gemini
output_text_queue AI response text for TTS
stt_queue Control STT listening
movement_queue Robot movement commands
image_queue Images to display on LCD
gif_queue GIFs to display

Part 4: Voice Commands

Movement Commands

move_cmd_functions = {
    "action": move_api.init_movement,
    "sit": move_api.squat,
    "move forwards": move_api.move_forward,
    "move backwards": move_api.move_backward,
    "move left": move_api.move_left,
    "move right": move_api.move_right,
    "look up": move_api.look_up,
    "look down": move_api.look_down,
    "look left": move_api.look_left,
    "look right": move_api.look_right,
    "look upper left": move_api.look_upperleft,
    "look lower left": move_api.look_leftlower,
    "look upper right": move_api.look_upperright,
    "look lower right": move_api.look_rightlower,
}

System Commands

sys_cmds_functions = {
    "shut up": close_ai,      # Disable AI responses
    "speak please": open_ai,  # Enable AI responses
    "reboot": reboot,         # Reboot the system
    "power off": power_off,   # Shutdown the system
}

Part 5: Multi-Language Support

The assistant supports multiple languages for TTS:

from google.cloud import texttospeech

# Voice configurations
voice_EN = texttospeech.VoiceSelectionParams(language_code="en-US", name="en-US-Standard-E")
voice_JP = texttospeech.VoiceSelectionParams(language_code="ja-JP", name="ja-JP-Neural2-B")
voice_CN = texttospeech.VoiceSelectionParams(language_code="cmn-CN", name="cmn-CN-Wavenet-A")
voice_FR = texttospeech.VoiceSelectionParams(language_code="fr-FR", name="fr-FR-Standard-C")
voice_DE = texttospeech.VoiceSelectionParams(language_code="de-DE", name="de-DE-Neural2-D")
voice_ES = texttospeech.VoiceSelectionParams(language_code="es-US", name="es-US-Wavenet-A")

lang_voices = {
    "Japanese": voice_JP,
    "Chinese": voice_CN,
    "French": voice_FR,
    "German": voice_DE,
    "Spanish": voice_ES,
}

Say “translate to Japanese” or “say in French” to switch languages.


Part 6: Core Tasks

Speech-to-Text Task

def stt_task():
    """Listen for voice input and convert to text"""
    py_audio = google_api.init_pyaudio()
    speech_client = google_api.init_speech_to_text()
    
    while True:
        # Wait for signal to start listening
        should_stt = stt_queue.get()
        stt_queue.task_done()
        
        if not should_stt:
            continue
            
        # Listen and transcribe
        user_input, stream = google_api.start_speech_to_text(speech_client, py_audio)
        
        # Process command
        move_key = get_move_cmd(user_input, move_cmd_functions)
        
        if move_key:
            movement_queue.put(move_key)
            output_text_queue.put(f"OK, {move_key}.")
        else:
            input_text_queue.put(user_input)
        
        google_api.stop_speech_to_text(stream)

Gemini AI Task

def gemini_task():
    """Process text input with Gemini AI"""
    conversation = google_api.create_conversation()
    
    # Initialize with system prompt
    init_prompt = """Answer concisely in a conversational tone. 
    Keep responses brief, like natural speech."""
    google_api.ai_text_response(conversation, init_prompt)
    
    while True:
        input_text = input_text_queue.get()
        input_text_queue.task_done()
        
        if "photo" in input_text or "picture" in input_text:
            # Use vision model
            image = media_api.take_photo()
            response = google_api.ai_image_response(model, image, input_text)
        else:
            # Use text model
            response = google_api.ai_text_response(conversation, input_text)
        
        output_text_queue.put(response)

Text-to-Speech Task

def tts_task():
    """Convert text responses to speech"""
    os.system("amixer -c 0 sset 'Headphone' 100%")
    tts_client, voice, audio_config = google_api.init_text_to_speech()
    
    while True:
        out_text = output_text_queue.get()
        output_text_queue.task_done()
        
        # Remove emojis and special characters
        out_text = remove_emojis(out_text).replace('*', '')
        
        if out_text:
            google_api.text_to_speech(out_text, tts_client, voice, audio_config)
        
        # Signal STT to start listening again
        stt_queue.put(True)

Movement Task

def move_task():
    """Execute robot movement commands"""
    while True:
        move_command = movement_queue.get()
        movement_queue.task_done()
        
        if move_command in move_cmd_functions:
            move_cmd_functions[move_command]()

Part 7: Running the Application

Clone the Repository

cd ~
git clone https://github.com/lbaitemple/apps-md-robots.git
cd apps-md-robots

Configure Environment

cp env.sample .env
nano .env
# Add your API_KEY_PATH

Run the Application

python ai_app/ai_app.py

Part 8: Example Interactions

Basic Conversation

You: "Hello, how are you?"
Robot: "I'm doing great! How can I help you today?"

You: "What's the weather like?"
Robot: "I don't have access to weather data, but I can help with other questions!"

Movement Commands

You: "Move forward"
Robot: "OK, moving forward." [Robot moves forward]

You: "Look up"
Robot: "OK, looking up." [Robot looks up]

You: "Sit"
Robot: "OK, sitting down." [Robot sits]

Photo Analysis

You: "Take a picture and tell me what you see"
Robot: [Takes photo] "I can see a living room with a couch and a table..."

Language Switching

You: "Say hello in Japanese"
Robot: "こんにちは!" [Spoken in Japanese]

You: "Translate 'thank you' to French"
Robot: "Merci!" [Spoken in French]

Part 9: Customization

Add Custom Commands

# Add to move_cmd_functions
move_cmd_functions["dance"] = move_api.dance
move_cmd_functions["wave"] = move_api.wave_paw

# Add to sys_cmds_functions
sys_cmds_functions["tell me a joke"] = tell_joke

Change Default Voice

Edit .env:

LANGUAGE_CODE=en-GB
LANGUAGE_NAME=en-GB-Neural2-B

Adjust Response Style

Modify the initialization prompt in gemini_task():

init_prompt = """You are a friendly robot dog named Pupper. 
Be playful and enthusiastic in your responses. 
Keep answers short and fun!"""

Exercises

Exercise 1: Add Weather Command

Integrate a weather API to respond to weather queries.

Exercise 2: Custom Wake Word

Implement a wake word (like “Hey Pupper”) before processing commands.

Exercise 3: Emotion Display

Show different images on the LCD based on the conversation mood.

Exercise 4: Music Player

Add commands to play music files from the robot.


Troubleshooting

Issue Solution
No audio input Check microphone: arecord -l
No audio output Check speaker: aplay -l, verify volume
API errors Verify credentials path in .env
Slow response Check internet connection
Commands not recognized Speak clearly, check language settings

Debug Mode

Enable detailed logging:

logging.basicConfig(level=logging.DEBUG)

Summary

In this lab, you learned:

  • How to set up Google Speech-to-Text and Text-to-Speech APIs
  • How to build a multi-threaded voice assistant
  • How to integrate voice commands with robot movement
  • How to support multiple languages
  • How to combine vision and voice for interactive AI

Reference