Aziza Ai Assistant

Status History

May 29, 04:37 PM Pending

May 31, 12:55 PM In Progress

Intake Form

budget 1000

timeline 1 week

description # Aziza Project – Scope of Work ## Overview This document defines the full scope, deliverables, technical approach, and validation criteria for the Aziza real-time voice AI system based on NVIDIA PersonaPlex (Moshi). ## Core Objectives - Real-time full-duplex voice AI - Persona + voice control - 8–10 concurrent sessions - <120ms latency - Production deployment readiness ## Phases ### Phase 1: Performance & Scaling - FP8 optimization - Dynamic batching - KV-cache tuning - Load testing **Deliverables** - ≥8 sessions stable - Benchmark report ### Phase 2: Multilingual Persona - RU + UZ support - LoRA/QLoRA fine-tuning - Dataset + tokenizer work **Deliverables** - Fine-tuned model - Training scripts ### Phase 3: Production Integration - Asterisk integration - API + orchestration - Docker deployment **Deliverables** - Fully integrated system - Documentation ## Acceptance Criteria - Stable real-time voice interaction - Meets latency + concurrency targets

Technical Specification

Frontend

React SPA (session monitor dashboard + persona config UI)

Backend

FastAPI (Python) — WebSocket + REST API for voice session orchestration

Database

PostgreSQL via Neon (session logs, persona configs, benchmark results)

Hosting

Docker Compose on bare-metal / GPU server (NVIDIA A100/H100 required for FP8)

Summary

Aziza is a production-grade real-time full-duplex voice AI system built on NVIDIA PersonaPlex (Moshi), designed to handle 8–10 simultaneous voice conversations with sub-120ms end-to-end latency. The system combines FP8 model quantization, dynamic cross-session batching, and KV-cache tuning to maximize GPU utilization, while LoRA fine-tuning adds native Russian and Uzbek language personas alongside English. What makes it particularly sophisticated is the integration of Asterisk PBX for real telephone call routing, a live operator monitoring dashboard, and a fully containerized GPU deployment stack — making it a complete, carrier-grade conversational AI platform rather than a research prototype.

File Structure

docker-compose.yml Orchestrates all services: FastAPI, Moshi inference engine, Asterisk, Redis, PostgreSQL, and dashboard

Dockerfile.api FastAPI application container with Python 3.11, CUDA dependencies, and voice processing libraries

Dockerfile.inference NVIDIA PersonaPlex/Moshi inference server container with FP8 support and TensorRT-LLM

Dockerfile.asterisk Asterisk PBX container with AGI/ARI bridge configured for real-time audio streaming

api/main.py FastAPI entrypoint — registers routers, initializes WebSocket manager, connects to DB and Redis

api/routers/sessions.py REST + WebSocket endpoints for creating, managing, and terminating voice AI sessions

api/routers/personas.py CRUD endpoints for persona definitions including voice profile, language, and behavioral parameters

api/routers/benchmarks.py Endpoints to trigger load tests, retrieve benchmark reports, and export latency/concurrency metrics

api/routers/health.py Health check and readiness probe endpoints for Docker/orchestration liveness checks

api/services/session_manager.py Core session orchestration logic — manages up to 10 concurrent full-duplex WebSocket voice sessions with slot allocation

api/services/inference_client.py gRPC/HTTP client to communicate with the Moshi inference server, handles audio chunk streaming and response buffering

api/services/asterisk_bridge.py ARI (Asterisk REST Interface) bridge — connects inbound SIP calls to active AI sessions via audio relay

api/services/latency_tracker.py Real-time latency measurement service — tracks end-to-end audio round-trip and logs P50/P95/P99 metrics

api/services/persona_loader.py Loads persona configs from DB, applies LoRA adapter weights, and configures voice synthesis parameters per session

api/models/session.py SQLAlchemy ORM model for voice sessions including state, persona_id, latency stats, and timestamps

api/models/persona.py SQLAlchemy ORM model for persona definitions — name, language (en/ru/uz), voice embedding, system prompt

api/models/benchmark.py SQLAlchemy ORM model for storing benchmark run results including concurrency level, avg latency, and pass/fail status

api/websocket/audio_handler.py WebSocket handler for bidirectional raw audio streaming — receives mic input, forwards to inference, streams back synthesized audio

api/websocket/connection_manager.py Manages WebSocket connection pool, enforces max 10 concurrent session limit, handles graceful disconnects

inference/server.py Moshi/PersonaPlex inference server — loads FP8-quantized model, exposes gRPC streaming endpoint for audio chunk processing

inference/optimization/fp8_loader.py Loads and applies FP8 quantization to Moshi model weights using TensorRT-LLM or transformer-engine

inference/optimization/dynamic_batcher.py Dynamic batching logic — accumulates audio chunks across concurrent sessions and batches inference calls to maximize GPU utilization

inference/optimization/kv_cache_config.py KV-cache configuration and tuning — sets cache size, eviction policy, and per-session memory allocation for 8-10 concurrent contexts

training/finetune_lora.py LoRA/QLoRA fine-tuning script for adding Russian and Uzbek language support to the base Moshi model

training/dataset_prep.py Dataset preparation pipeline — processes RU/UZ audio datasets, generates transcripts, and formats for Moshi training

training/tokenizer_extension.py Extends base tokenizer with Russian and Uzbek vocabulary, handles Cyrillic and Latin Uzbek script variants

training/configs/lora_ru.yaml LoRA hyperparameter config for Russian language fine-tuning — rank, alpha, target modules, learning rate schedule

training/configs/lora_uz.yaml LoRA hyperparameter config for Uzbek language fine-tuning — rank, alpha, target modules, learning rate schedule

benchmarks/load_test.py Load testing script using asyncio — simulates 8-10 concurrent voice sessions, measures latency distribution and session stability

benchmarks/report_generator.py Generates structured benchmark reports in JSON and PDF formats with latency charts and pass/fail assessment against targets

asterisk/extensions.conf Asterisk dialplan configuration routing inbound SIP calls to the ARI application for AI session handoff

asterisk/ari_app.py Asterisk ARI application — answers calls, establishes audio bridge to FastAPI session manager via WebSocket relay

asterisk/sip.conf SIP trunk and endpoint configuration for Asterisk PBX integration

dashboard/src/App.jsx Root React component — renders session monitor, persona manager, and benchmark dashboard views

dashboard/src/components/SessionMonitor.jsx Real-time display of active voice sessions — shows session ID, persona, language, current latency, and duration

dashboard/src/components/PersonaEditor.jsx Form UI for creating and editing AI personas — name, language selector (EN/RU/UZ), voice profile, system prompt

dashboard/src/components/BenchmarkPanel.jsx Benchmark control panel — trigger load tests, display live progress, render latency charts and final report

dashboard/src/components/LatencyGauge.jsx Real-time latency gauge component showing current P95 latency vs 120ms target threshold with color-coded status

dashboard/src/hooks/useSessionWebSocket.js Custom React hook for subscribing to session state updates via WebSocket for live dashboard refresh

dashboard/src/api/client.js Axios-based API client with base URL config and request interceptors for dashboard-to-backend communication

migrations/001_initial_schema.sql Initial PostgreSQL schema — creates sessions, personas, and benchmark_results tables with indexes

config/settings.py Pydantic settings model — loads all environment variables for DB URL, Redis, inference server address, Asterisk ARI credentials

config/logging.yaml Structured logging configuration with JSON output, log levels per module, and session_id correlation field

scripts/start_inference.sh Shell script to launch inference server with correct CUDA device flags, FP8 mode, and memory allocation parameters

scripts/run_benchmark.sh Convenience script to execute full benchmark suite and output report to /reports directory

docs/architecture.md System architecture documentation — component diagram, data flow, latency budget breakdown, and scaling notes

docs/deployment.md Step-by-step production deployment guide — GPU requirements, Docker Compose setup, Asterisk config, and environment variables

docs/training_guide.md Guide for running LoRA fine-tuning — dataset requirements, training commands, adapter merging, and evaluation procedure

tests/test_session_manager.py Unit and integration tests for session manager — concurrency limits, slot allocation, and graceful teardown

tests/test_latency.py Latency measurement tests — validates end-to-end audio round-trip stays under 120ms threshold under simulated load

tests/test_persona_api.py API tests for persona CRUD endpoints — create, update, delete, and language validation

.env.example Example environment file documenting all required variables — DB URL, Redis URL, inference host, Asterisk ARI URL, GPU device IDs

Features (10)

Full-Duplex Real-Time Voice Streaming P1

Bidirectional WebSocket audio pipeline enabling simultaneous speech input and AI voice output with sub-120ms end-to-end latency.

WebSocket endpoint accepts raw PCM audio chunks at 16kHz/16-bit from client
Synthesized AI audio response begins streaming back within 120ms of final input chunk
P95 latency measured across 100 consecutive exchanges remains below 120ms
Full-duplex operation confirmed — AI can be interrupted mid-utterance by new user speech
No audio artifacts or buffer underruns detected during 10-minute continuous session

Concurrent Session Management (8–10 Sessions) P1

Session orchestration layer that maintains up to 10 simultaneous independent voice AI sessions with isolated context and persona state.

System accepts and maintains exactly 10 concurrent WebSocket voice sessions without degradation
11th connection attempt returns HTTP 503 with clear capacity error message
Each session maintains fully isolated KV-cache context — no cross-session bleed
All 10 sessions simultaneously sustain <120ms latency under load test
Session cleanup releases GPU memory within 2 seconds of disconnect
Load test benchmark report confirms ≥8 stable sessions as deliverable threshold

FP8 Model Optimization & Dynamic Batching P1

GPU inference optimization using FP8 quantization and dynamic cross-session batching to achieve required concurrency and latency targets on a single GPU.

Moshi model loads successfully in FP8 precision using TensorRT-LLM or transformer-engine
FP8 model VRAM footprint is ≤50% of BF16 baseline measured via nvidia-smi
Dynamic batcher aggregates audio chunks from multiple sessions into single GPU kernel calls
Benchmark shows ≥40% throughput improvement vs non-batched baseline
Model output quality (MOS score or human eval) does not degrade more than 5% vs BF16 baseline

KV-Cache Tuning for Multi-Session Contexts P2

Per-session KV-cache configuration ensuring each of the 10 concurrent sessions retains sufficient conversational context without OOM errors.

KV-cache size per session is configurable via environment variable
10 concurrent sessions each maintain minimum 2048-token context window
No CUDA OOM errors observed during 30-minute 10-session load test
Cache eviction policy (sliding window or LRU) is documented and configurable
Memory allocation report included in benchmark output

Multilingual Persona Support (EN / RU / UZ) P2

LoRA/QLoRA fine-tuned model adapters enabling natural Russian and Uzbek language voice interaction alongside the base English capability.

Fine-tuned LoRA adapters for Russian and Uzbek are produced and loadable at runtime
Model correctly responds in Russian when session language is set to 'ru'
Model correctly responds in Uzbek (Latin script) when session language is set to 'uz'
Word Error Rate (WER) on RU/UZ held-out test set is ≤20%
Language switching between sessions does not require inference server restart
Training scripts are reproducible with documented dataset sources and preprocessing steps

Persona Configuration & Voice Profile Management P2

API and UI for defining named AI personas with distinct voice embeddings, system prompts, and language assignments that persist across sessions.

REST API supports full CRUD for persona objects with name, language, system_prompt, and voice_embedding fields
Persona is applied to a session at creation time and cannot be changed mid-session
Two distinct personas produce audibly different voice characteristics in A/B test
Persona configs persist in PostgreSQL and survive service restart
Dashboard UI allows creating, editing, and deleting personas with form validation

Asterisk PBX Integration P2

SIP telephony integration via Asterisk ARI that routes inbound phone calls to available AI voice sessions transparently.

Inbound SIP call to configured DID triggers ARI application and claims an available AI session slot
Audio from phone call is relayed bidirectionally to/from the AI session with <150ms added latency
Call hangup triggers clean session termination and slot release
Busy signal or rejection message played when all 10 session slots are occupied
Asterisk integration tested with at least 3 simultaneous SIP calls in staging environment

Real-Time Monitoring Dashboard P3

React-based operator dashboard providing live visibility into active sessions, per-session latency, persona assignments, and system health.

Dashboard displays all active sessions with session ID, persona name, language, duration, and current latency
Latency gauge updates in real-time (≤1s refresh) and turns red when P95 exceeds 120ms
Session list updates within 2 seconds of a new session connecting or disconnecting
Dashboard is accessible at port 3000 and requires no authentication for MVP
Works correctly in Chrome and Firefox on desktop

Benchmark & Load Testing Suite P3

Automated load testing tooling that simulates maximum concurrent sessions and generates a structured benchmark report validating acceptance criteria.

Load test script simulates 8 and 10 concurrent voice sessions using pre-recorded audio fixtures
Report captures P50, P95, P99 latency, session success rate, and GPU memory utilization
Report is output as both JSON and human-readable PDF/HTML
Test run is executable via single shell command: ./scripts/run_benchmark.sh
Report clearly indicates PASS or FAIL against the <120ms P95 and ≥8 sessions targets

Dockerized Production Deployment P3

Fully containerized multi-service deployment using Docker Compose with GPU passthrough, health checks, and documented runbook for production launch.

docker-compose up brings all services (API, inference, Asterisk, Redis, PostgreSQL, dashboard) to healthy state
NVIDIA GPU is correctly passed through to inference container via nvidia-docker runtime
All services have Docker health checks with appropriate intervals and failure thresholds
System recovers automatically if inference server crashes (restart: unless-stopped policy)
Deployment documentation covers prerequisites, env var setup, and first-run validation steps
Cold start from docker-compose up to first successful voice session completes within 3 minutes

Build Log

scoping Starting AI-powered tech spec generation

scoping Tech spec generated successfully

start Build orchestration started for project 8

attempt Build attempt 1/3

generate Attempt 1 failed: AI generation failed: Unterminated string in JSON at position 49759 failed

retry Retrying (2/3)...

attempt Build attempt 2/3

generate Attempt 2 failed: AI generation failed: 429 Daily token limit reached (100,000 tokens). Resets at midnight UTC. failed

retry Retrying (3/3)...

attempt Build attempt 3/3

generate Attempt 3 failed: AI generation failed: 429 Daily token limit reached (100,000 tokens). Resets at midnight UTC. failed

complete Build failed after 3 attempts failed

status Project status updated to Build Failed

Deliverables

📦

Deliverables become available once project reaches Review status.