Aziza Ai Assistant

Status History

May 29, 04:18 PM Pending

May 31, 12:55 PM In Progress

Intake Form

budget 1000

timeline 1

description ## Overview This document defines the full scope, deliverables, technical approach, and validation criteria for the Aziza real-time voice AI system based on NVIDIA PersonaPlex (Moshi). ## Core Objectives - Real-time full-duplex voice AI - Persona + voice control - 8–10 concurrent sessions - <120ms latency - Production deployment readiness ## Phases ### Phase 1: Performance & Scaling - FP8 optimization - Dynamic batching - KV-cache tuning - Load testing **Deliverables** - ≥8 sessions stable - Benchmark report ### Phase 2: Multilingual Persona - RU + UZ support - LoRA/QLoRA fine-tuning - Dataset + tokenizer work **Deliverables** - Fine-tuned model - Training scripts ### Phase 3: Production Integration - Asterisk integration - API + orchestration - Docker deployment **Deliverables** - Fully integrated system - Documentation ## Acceptance Criteria - Stable real-time voice interaction - Meets latency + concurrency targets

Technical Specification

Frontend

React SPA (session monitor dashboard + persona config UI)

Backend

FastAPI (Python) REST + WebSocket API with asyncio concurrency

Database

PostgreSQL via Neon (session logs, persona configs, metrics) + Redis (session state, KV-cache coordination)

Hosting

Docker Compose on bare-metal / GPU server (NVIDIA A100/H100) with Nginx reverse proxy

Summary

Aziza is a production-grade real-time full-duplex voice AI system built on NVIDIA PersonaPlex (Moshi), designed to handle 8–10 concurrent telephone and WebSocket voice sessions with sub-120ms latency. The system features FP8-quantized inference with dynamic batching, per-session KV-cache management, multilingual support for Russian and Uzbek via QLoRA fine-tuned adapters, and deep Asterisk PBX integration for PSTN/SIP call routing. What makes it technically distinctive is the combination of GPU-level inference optimization, language-specific persona control, and a complete telephony bridge — all orchestrated through a Dockerized microservice architecture with real-time observability.

File Structure

docker-compose.yml Orchestrates all services: FastAPI, Moshi inference engine, Redis, PostgreSQL, Nginx, Asterisk

Dockerfile.api FastAPI application container with Python 3.11, CUDA dependencies, and model client libraries

Dockerfile.inference NVIDIA PersonaPlex/Moshi inference server with FP8 TensorRT-LLM optimization and GPU passthrough

Dockerfile.asterisk Asterisk PBX container with AGI/ARI bridge configuration for PSTN/SIP integration

nginx/nginx.conf Reverse proxy config with WebSocket upgrade support, load balancing, and SSL termination

server/main.py FastAPI app entrypoint: registers routers, initializes session manager, connects to Redis and PostgreSQL

server/config.py Pydantic settings: model paths, GPU config, concurrency limits, latency thresholds, language settings

server/routers/sessions.py WebSocket endpoint for full-duplex voice streaming; manages session lifecycle, audio chunking, and response piping

server/routers/personas.py REST CRUD endpoints for persona definitions (name, voice profile, language, LoRA adapter reference)

server/routers/admin.py Admin endpoints for system health, active session count, GPU utilization, and benchmark trigger

server/routers/asterisk.py ARI webhook receiver and AGI handler for bridging Asterisk calls into the voice AI session pipeline

server/services/session_manager.py Manages up to 10 concurrent sessions with slot allocation, timeout handling, and graceful teardown

server/services/inference_client.py gRPC/HTTP2 client to the Moshi inference server; handles streaming audio in/out and dynamic batching requests

server/services/audio_pipeline.py Audio preprocessing: resampling, VAD (voice activity detection), codec handling (OPUS/PCM), and chunk buffering

server/services/persona_service.py Loads persona config and maps to correct LoRA adapter, voice embedding, and language tokenizer at session init

server/services/metrics_service.py Collects per-session latency, TTFB, concurrency stats; writes to PostgreSQL and exposes Prometheus metrics

server/models/session.py SQLAlchemy ORM model for session records: id, persona_id, language, start_time, end_time, avg_latency_ms

server/models/persona.py SQLAlchemy ORM model for persona: id, name, language, voice_profile_path, lora_adapter_path, system_prompt

server/db/database.py Async SQLAlchemy engine setup, session factory, and Alembic migration integration

server/db/redis_client.py Redis async client wrapper for session slot tracking, KV-cache state keys, and pub/sub coordination

inference/server.py Moshi inference server entrypoint: loads FP8-quantized model, initializes TensorRT engine, starts gRPC streaming service

inference/batching.py Dynamic batching logic: groups incoming audio frames across sessions to maximize GPU throughput within latency budget

inference/kv_cache.py KV-cache management: per-session cache allocation, eviction policy, and cache-hit optimization for streaming inference

inference/fp8_optimizer.py FP8 quantization utilities using TensorRT-LLM: model export, calibration dataset runner, and precision validation

training/finetune_lora.py QLoRA fine-tuning script for RU/UZ language adaptation using HuggingFace PEFT; configures rank, alpha, target modules

training/dataset_builder.py Builds multilingual voice dataset: audio segmentation, transcript alignment, language tagging for RU and UZ corpora

training/tokenizer_extension.py Extends base tokenizer with RU/UZ vocabulary, Cyrillic/Latin script handling, and phoneme mappings

training/train_config.yaml Training hyperparameters: learning rate, batch size, LoRA rank, warmup steps, evaluation strategy, checkpoint paths

asterisk/extensions.conf Dialplan configuration routing inbound calls to AGI script that bridges audio to the FastAPI session endpoint

asterisk/agi/voice_bridge.py AGI script: captures RTP audio from Asterisk, streams to FastAPI WebSocket, and plays back AI response audio

asterisk/ari_config.conf ARI (Asterisk REST Interface) configuration for programmatic call control and media streaming

benchmarks/load_test.py Locust-based load test simulating 8-10 concurrent voice sessions; measures latency percentiles and session stability

benchmarks/latency_report.py Parses load test results and generates benchmark report: p50/p95/p99 latency, concurrency ceiling, GPU memory usage

dashboard/src/App.tsx React app root: sets up routing between session monitor, persona manager, and system metrics views

dashboard/src/pages/SessionMonitor.tsx Live dashboard showing active sessions, per-session latency gauge, language/persona label, and session controls

dashboard/src/pages/PersonaManager.tsx CRUD UI for persona configurations: name, language selector (EN/RU/UZ), voice profile upload, LoRA adapter assignment

dashboard/src/pages/SystemMetrics.tsx GPU utilization charts, concurrency history, latency trends, and benchmark report viewer using Recharts

dashboard/src/hooks/useSessionSocket.ts Custom React hook managing WebSocket connection for real-time session audio streaming and status updates

dashboard/src/api/client.ts Axios client with base URL config, auth headers, and typed request/response interfaces for all REST endpoints

migrations/ Alembic migration scripts for PostgreSQL schema versioning across sessions, personas, and metrics tables

docs/architecture.md System architecture overview: component diagram, data flow, GPU pipeline, Asterisk integration, and scaling notes

docs/deployment.md Step-by-step production deployment guide: GPU driver setup, Docker Compose startup, Asterisk SIP trunk config, SSL setup

docs/api_reference.md Full API reference for WebSocket session protocol, REST endpoints, audio format specs, and error codes

docs/training_guide.md Guide for running LoRA fine-tuning: dataset preparation, training execution, adapter export, and model evaluation

.env.example Template for all required environment variables: DB URLs, Redis URL, GPU device IDs, model paths, Asterisk credentials

Features (12)

Full-Duplex Real-Time Voice Streaming P1

WebSocket-based bidirectional audio pipeline enabling simultaneous speech input and AI voice output with sub-120ms end-to-end latency.

WebSocket endpoint accepts PCM/OPUS audio chunks at 16kHz or 24kHz sample rate
AI-generated audio response begins streaming back within 120ms of last input voice activity
Full-duplex operation confirmed: AI can be interrupted mid-response by new user speech
Voice Activity Detection correctly segments speech turns without manual push-to-talk
Audio pipeline handles packet loss and jitter without session crash

Concurrent Session Management (8–10 Sessions) P1

Session slot manager that supports 8–10 simultaneous independent voice AI sessions with isolated state and graceful capacity enforcement.

System accepts and maintains exactly 10 concurrent WebSocket voice sessions without degradation
11th connection attempt receives a 503 with queue position or rejection message
Each session maintains fully isolated KV-cache, persona context, and audio buffer
Session cleanup on disconnect frees GPU memory and Redis slot within 2 seconds
Load test confirms p95 latency remains under 120ms at 10 concurrent sessions

FP8 Inference Optimization P1

TensorRT-LLM FP8 quantization of the Moshi model to maximize GPU throughput and minimize per-token latency for real-time voice generation.

FP8-quantized model loads successfully on target NVIDIA GPU (A100/H100)
Benchmark shows ≥40% throughput improvement over FP16 baseline
Voice output quality degradation is imperceptible in blind A/B listening test
FP8 model supports dynamic batching across concurrent sessions
Calibration and export scripts run end-to-end without manual intervention

Dynamic Batching Engine P1

Inference batching layer that groups audio frames from multiple concurrent sessions to maximize GPU utilization while respecting per-session latency budgets.

Batching window is configurable (default 20ms) and auto-tunes based on session count
No single session waits more than one batch window beyond its natural latency
GPU utilization increases by ≥30% compared to per-session sequential inference
Batch formation logic handles variable audio chunk sizes across sessions
Metrics endpoint reports current batch sizes and queue depths in real time

KV-Cache Per-Session Management P2

Persistent KV-cache allocation per session enabling coherent multi-turn conversation context without recomputation overhead.

Each session is allocated a dedicated KV-cache slot at connection time
Cache persists conversation context for the full session duration (up to 30 minutes)
Cache eviction policy prevents OOM under maximum concurrency load
Cache hit rate ≥85% for mid-conversation turns reported in metrics
Cache state is fully cleared and GPU memory reclaimed on session end

Multilingual Support (Russian + Uzbek) P2

LoRA/QLoRA fine-tuned model adapters enabling natural voice AI interaction in Russian and Uzbek languages alongside English.

Model correctly identifies and responds in Russian when spoken to in Russian
Model correctly identifies and responds in Uzbek when spoken to in Uzbek
Word Error Rate (WER) for RU ≤15% on held-out test set
Word Error Rate (WER) for UZ ≤20% on held-out test set
Language switching mid-session is handled gracefully without session restart
LoRA adapters load in under 500ms at session initialization

Persona Configuration System P2

CRUD system for defining AI personas with distinct names, voice profiles, system prompts, language preferences, and associated LoRA adapters.

Admin can create, update, and delete personas via REST API and dashboard UI
Each persona stores: name, language, voice embedding path, LoRA adapter reference, system prompt
Persona is applied at session initialization and governs all AI responses in that session
Voice profile produces perceptibly distinct voice characteristics between different personas
Default persona is used when no persona_id is specified at session start

Asterisk PBX Integration P2

Full integration with Asterisk via AGI and ARI to route inbound telephone calls into the real-time voice AI session pipeline.

Inbound SIP/PSTN call to configured DID triggers AGI script and creates a voice AI session
RTP audio from Asterisk is correctly transcoded and streamed to the FastAPI WebSocket endpoint
AI response audio is played back to caller with no audible artifacts or sync issues
Call hangup triggers clean session teardown within 3 seconds
System handles 8 simultaneous inbound calls without audio quality degradation
ARI-based call control supports transfer, hold, and DTMF passthrough

QLoRA Fine-Tuning Pipeline P2

End-to-end training pipeline for fine-tuning the Moshi base model on RU/UZ voice datasets using QLoRA with configurable hyperparameters.

Training script runs on single A100 80GB GPU without OOM errors
Dataset builder produces correctly formatted and language-tagged training samples
Tokenizer extension adds RU/UZ vocabulary without corrupting base model tokenization
Training run produces checkpoint every N steps with automatic best-model selection
Exported LoRA adapter integrates with inference server without format conversion
Training guide documentation enables reproduction of results by a new engineer

Observability & Metrics Dashboard P3

Real-time monitoring dashboard and Prometheus metrics endpoint tracking session health, latency, GPU utilization, and concurrency.

Dashboard displays all active sessions with per-session latency, language, and persona in real time
Prometheus /metrics endpoint exposes: active_sessions, p50/p95/p99 latency, gpu_utilization, batch_size
Historical latency and concurrency charts available for last 24 hours
Alert threshold configurable for latency breaches above 120ms
Benchmark report page renders load test results with pass/fail against acceptance criteria

Load Testing & Benchmark Reporting P3

Automated load testing suite that validates the system meets concurrency and latency acceptance criteria and generates a structured benchmark report.

Load test simulates 10 concurrent voice sessions with realistic audio input streams
Test runs for minimum 10 minutes to validate session stability
Report captures: max stable concurrency, p50/p95/p99 TTFB, GPU memory peak, error rate
System passes if: ≥8 stable sessions AND p95 latency ≤120ms AND error rate ≤0.1%
Report is auto-generated as JSON and HTML after each test run

Docker Compose Production Deployment P3

Fully containerized deployment stack with GPU passthrough, service health checks, restart policies, and environment-based configuration.

docker-compose up --build starts all services including inference server with GPU access
All services have health checks and restart: unless-stopped policies
Environment variables fully control deployment without code changes
Nginx correctly terminates SSL and proxies WebSocket connections
Deployment guide enables a new engineer to stand up the full system in under 2 hours

Build Log

scoping Starting AI-powered tech spec generation

scoping Tech spec generated successfully

start Build orchestration started for project 7

attempt Build attempt 1/3

generate Attempt 1 failed: AI generation failed: Unexpected end of JSON input failed

retry Retrying (2/3)...

attempt Build attempt 2/3

generate Attempt 2 failed: AI generation failed: 429 Daily token limit reached (100,000 tokens). Resets at midnight UTC. failed

retry Retrying (3/3)...

attempt Build attempt 3/3

generate Attempt 3 failed: AI generation failed: 429 Daily token limit reached (100,000 tokens). Resets at midnight UTC. failed

complete Build failed after 3 attempts failed

status Project status updated to Build Failed

Deliverables

📦

Deliverables become available once project reaches Review status.