Aziza Ai Assistant

Status History

May 29, 04:16 PM Pending

May 31, 12:56 PM In Progress

Intake Form

budget 1000

timeline 1 week

description MASTER BUILD PROMPT — AZIZA REAL-TIME MULTILINGUAL VOICE AI PLATFORM WITH PERSONAPLEX You are a senior distributed systems architect, AI infrastructure engineer, real-time voice systems engineer, GPU optimization engineer, and conversational AI orchestration engineer. Your task is to build a production-grade multilingual real-time AI voice platform named “Aziza”. The frontend already exists. You must build the ENTIRE backend, orchestration layer, streaming pipeline, inference infrastructure, PersonaPlex memory system, and production deployment architecture in the CORRECT IMPLEMENTATION ORDER. The platform MUST prioritize: ultra-low latency streaming-first architecture concurrent session stability GPU optimization modular microservices telephony support persona persistence multilingual conversations long-term memory production scalability Dockerized deployment DO NOT: start with fine-tuning start with personas before infrastructure embed persona logic inside inference runtime build synchronous pipelines create monolithic architecture The system MUST be modular and production-grade. PRIMARY OBJECTIVES The system must support: real-time duplex AI voice conversations multilingual streaming conversations persistent personas long-term conversational memory frontend voice interaction telephony-based AI calls concurrent GPU-backed sessions streaming token generation low-latency orchestration Redis-backed session routing Dockerized deployment horizontal scalability REQUIRED TECH STACK Frontend Existing frontend already provided Next.js React WebSocket client BACKEND STACK Core Backend NestJS FastAPI Python Socket.IO WebSocket Redis MongoDB NGINX Docker Docker Compose AI STACK Runtime Moshi Hugging Face Transformers PyTorch CUDA transformer-engine vLLM or TGI Fine-Tuning PEFT LoRA QLoRA bitsandbytes datasets accelerate AUDIO / STREAMING STACK WebRTC FFmpeg RTP Asterisk ARI SIP integration GPU REQUIREMENTS Target GPUs: L40S A100 H100 RTX 4090 minimum Must support: FP8 inference CUDA streams dynamic batching concurrent streaming inference COMPLETE SYSTEM ARCHITECTURE Use this architecture EXACTLY: Frontend ↓ NGINX Gateway ↓ NestJS API Gateway ↓ WebSocket Session Layer ↓ PersonaPlex Service ↓ AI Orchestrator ↓ Redis Queue Layer ↓ Moshi Streaming Runtime ↓ LLM Inference Runtime ↓ Audio Response Stream ↓ Frontend + Telephony Clients CORE SERVICES Build the following isolated services: 1. API Gateway Service Responsibilities: authentication session management websocket handling API routing rate limiting external client communication Tech: NestJS Socket.IO JWT 2. PersonaPlex Service Responsibilities: persona orchestration memory persistence prompt assembly contextual recall multilingual personality adaptation emotional state tracking session continuity Tech: FastAPI Redis MongoDB 3. AI Orchestrator Responsibilities: stream routing session lifecycle GPU scheduling worker allocation inference coordination Redis pub/sub coordination Tech: Python Redis asyncio 4. Moshi Streaming Runtime Responsibilities: streaming inference token generation audio streaming real-time response generation async inference execution Tech: PyTorch CUDA Hugging Face Moshi 5. Redis Layer Responsibilities: active session cache pub/sub messaging queue coordination stream buffering hot memory cache 6. MongoDB Layer Responsibilities: long-term memory conversation history persona persistence analytics summarized memory storage 7. Telephony Service Responsibilities: Asterisk integration RTP bridging SIP handling call lifecycle management IMPLEMENTATION ORDER (MANDATORY) DO NOT skip phases. DO NOT start future phases before previous phases are validated. PHASE 1 — INFRASTRUCTURE FOUNDATION Goal: Create stable GPU-enabled backend infrastructure. Tasks: Create Dockerized architecture Configure Docker Compose Configure NVIDIA Container Toolkit Verify GPU visibility inside containers Setup Redis Setup MongoDB Setup NGINX reverse proxy Setup environment configuration Setup centralized logging Setup monitoring foundation Deliverables: operational Docker environment GPU-enabled containers Redis operational MongoDB operational NGINX routing operational Validation: containers restart safely GPU accessible in containers Redis persistence operational Mongo persistence operational PHASE 2 — API GATEWAY Goal: Build backend communication foundation. Tasks: Build NestJS API gateway Add JWT authentication Add refresh tokens Add session management Add rate limiting Add websocket gateway Add request validation Add health endpoints Add structured logging Endpoints: /auth /session /stream/start /stream/stop /persona /health Deliverables: stable API gateway websocket communication authentication system Validation: websocket stability JWT auth operational session tracking operational PHASE 3 — REAL-TIME STREAMING FOUNDATION Goal: Establish real-time duplex audio streaming. Tasks: Build websocket audio pipeline Build audio chunk streaming Add FFmpeg transcoding Add async audio queues Add reconnect handling Add interruption handling Add stream synchronization Add buffering controls Requirements: async architecture low jitter low latency non-blocking streaming Deliverables: real-time audio streaming pipeline Validation: no desync stable streaming low latency maintained PHASE 4 — MOSHI INFERENCE RUNTIME Goal: Deploy streaming AI inference runtime. Tasks: Deploy Moshi runtime Load Hugging Face models Configure CUDA inference Add token streaming Add async generation Add tokenizer management Add KV-cache optimization Add inference memory management Requirements: streaming-first inference async execution token-level streaming Deliverables: functional streaming inference runtime Validation: stable token generation no memory leaks real-time inference operational PHASE 5 — GPU OPTIMIZATION Goal: Scale concurrent AI sessions. Tasks: Enable FP8 support Install transformer-engine Add CUDA stream optimization Add dynamic batching Add GPU scheduler Add memory pooling Add inference queue balancing Add latency monitoring Performance Targets: 8–10 concurrent sessions <120ms latency target Deliverables: optimized GPU inference runtime Validation: stable concurrent sessions no CUDA OOM no inference starvation PHASE 6 — PERSONAPLEX INTEGRATION Goal: Build persistent persona orchestration and long-term memory layer. PersonaPlex MUST exist as an isolated microservice. DO NOT embed persona logic inside: frontend Moshi inference runtime websocket layer Tasks: Deploy PersonaPlex service Add persona profile management Add long-term memory persistence Add Redis-backed hot memory cache Add contextual prompt assembly Add multilingual persona adaptation Add emotional state tracking Add session continuity Add memory summarization workers Add prompt injection middleware PersonaPlex APIs: POST /persona/load POST /persona/switch GET /persona/state POST /memory/store POST /memory/retrieve POST /prompt/build Requirements: async architecture Redis-first retrieval low-latency prompt assembly streaming-safe context windows Performance Target: <30ms prompt assembly overhead Deliverables: persistent persona system long-term memory system contextual orchestration layer Validation: persona consistency maintained memory persists across sessions multilingual memory operational no inference blocking PHASE 7 — AI ORCHESTRATOR Goal: Coordinate all runtime services. Tasks: Build orchestration service Add session routing Add stream lifecycle management Add GPU worker allocation Add Redis pub/sub coordination Add inference scheduling Add service discovery Add failover handling Deliverables: centralized orchestration system Validation: sessions isolated correctly stable routing no cross-session leakage PHASE 8 — FRONTEND INTEGRATION [COMPLETED] Goal: Connect backend to existing frontend. Tasks: [x] Connect websocket streams [x] Connect auth system [x] Connect streaming controls [x] Add reconnect logic [x] Add event synchronization [x] Add session state syncing Deliverables: [x] full frontend/backend communication Validation: [x] stable streaming interaction [x] frontend synchronization operational PHASE 9 — TELEPHONY INTEGRATION Goal: Enable AI voice calls through telephony. Tasks: Integrate Asterisk ARI Build RTP bridge Add SIP session handling Add telephony stream routing Add reconnect handling Add call lifecycle management Add recording controls Deliverables: telephony AI runtime Validation: stable phone calls low-latency RTP streaming no telephony desync PHASE 10 — MONITORING & OBSERVABILITY Goal: Production-grade monitoring stack. Tasks: Add Prometheus Add Grafana Add Loki Add GPU monitoring Add Redis monitoring Add websocket monitoring Add latency dashboards Add error tracking Monitor: GPU memory latency queue delays websocket failures CUDA errors token throughput Deliverables: observability stack Validation: live dashboards operational actionable alerts available PHASE 11 — MULTILINGUAL FINE-TUNING ONLY START AFTER ALL PREVIOUS PHASES ARE STABLE. Goal: Create multilingual conversational personas. Languages: Russian Uzbek Latin Uzbek Cyrillic English Tasks: Prepare datasets Audit tokenizer Extend tokenizer if necessary Train LoRA adapters Optimize inference compatibility Validate streaming quality Validate multilingual response quality Requirements: maintain low latency maintain streaming stability small adapter sizes Deliverables: multilingual fine-tuned adapters Validation: natural multilingual responses stable streaming maintained REDIS MEMORY STRATEGY Hot Cache Store: active context current conversation window emotional state persona modifiers stream-safe summaries Cold Storage Persist: summarized memories conversation history user interaction history persona evolution PERSONAPLEX SESSION MODEL { "session_id": "", "user_id": "", "persona_id": "", "language": "", "emotion_state": "", "conversation_summary": "", "active_context_window": [], "memory_refs": [] } SECURITY REQUIREMENTS Implement: JWT authentication HTTPS Redis authentication Mongo authentication WebSocket authentication container isolation environment-based secrets internal API authentication NEVER hardcode: API keys Hugging Face tokens database credentials PersonaPlex MUST NOT be publicly exposed. Only API Gateway may communicate externally. PERFORMANCE REQUIREMENTS The platform MUST: support concurrent streaming sessions support token-level streaming avoid synchronous pipelines avoid blocking inference support async orchestration support horizontal scaling Use: asyncio queue-based architecture Redis pub/sub GPU-aware scheduling dynamic batching DEPLOYMENT REQUIREMENTS Everything MUST be: Dockerized reproducible production-ready environment-configurable horizontally scalable Provide: Dockerfiles docker-compose.yml .env.example deployment scripts health checks REQUIRED OUTPUT AFTER EACH PHASE At the end of every phase provide: architecture summary folder structure source code Docker configuration deployment instructions testing steps validation checklist known limitations performance metrics DO NOT continue to the next phase until the current phase is validated successfully. Optimize for: stability low latency streaming reliability scalability production readiness multilingual conversational quality

Technical Specification

Frontend

Next.js + React (existing) with WebSocket/Socket.IO client, WebRTC audio capture

Backend

NestJS API Gateway (Node.js) + FastAPI PersonaPlex Service (Python) + Python AI Orchestrator + Python Moshi Streaming Runtime + Python Telephony Service

Database

MongoDB (long-term memory, personas, conversation history) + Redis (hot session cache, pub/sub, stream buffering)

Hosting

Docker + Docker Compose on GPU-enabled bare-metal or cloud VM (L40S / A100 / H100 / RTX 4090), NGINX reverse proxy, NVIDIA Container Toolkit

Summary

Aziza is a production-grade, GPU-accelerated, real-time multilingual AI voice platform built on a strict microservices architecture: an NGINX-fronted NestJS API Gateway handles all external WebSocket and REST traffic, while isolated internal services — PersonaPlex (FastAPI), AI Orchestrator (Python asyncio), and the Moshi Streaming Runtime (PyTorch CUDA) — coordinate persona persistence, GPU scheduling, and sub-120ms streaming inference across concurrent sessions. The platform uniquely combines a two-tier Redis/MongoDB memory architecture with the PersonaPlex persona engine to deliver emotionally-aware, language-adaptive AI conversations that persist across sessions, and extends to telephony clients via Asterisk ARI/RTP bridging, all deployed via Docker Compose with full Prometheus/Grafana/Loki observability and optional LoRA fine-tuning for Russian, Uzbek, and English multilingual quality.

File Structure

docker-compose.yml Orchestrates all services: nginx, nestjs-gateway, personaplex, ai-orchestrator, moshi-runtime, telephony, redis, mongodb, prometheus, grafana, loki

.env.example All environment variable templates: JWT secrets, Redis/Mongo credentials, HuggingFace tokens, GPU config, service URLs — never hardcoded

nginx/nginx.conf NGINX reverse proxy config: routes /api to NestJS gateway, /ws to websocket layer, /metrics to Prometheus, enforces HTTPS, rate limiting headers

nginx/Dockerfile NGINX container build with SSL cert mounting and custom config

services/gateway/Dockerfile NestJS API Gateway container build, Node.js 20 LTS base, production build step

services/gateway/package.json NestJS dependencies: @nestjs/core, @nestjs/websockets, socket.io, passport-jwt, throttler, class-validator, winston

services/gateway/src/main.ts NestJS bootstrap: enables CORS, Socket.IO adapter, global validation pipe, Helmet security, starts on port 3001

services/gateway/src/app.module.ts Root module importing AuthModule, SessionModule, StreamModule, PersonaModule, HealthModule, WebSocketGatewayModule

services/gateway/src/auth/auth.module.ts Auth module wiring: JwtModule, PassportModule, AuthService, AuthController, JwtStrategy, RefreshTokenStrategy

services/gateway/src/auth/auth.service.ts JWT issuance, refresh token rotation, bcrypt password hashing, token blacklist via Redis

services/gateway/src/auth/auth.controller.ts POST /auth/login, POST /auth/register, POST /auth/refresh, POST /auth/logout endpoints

services/gateway/src/auth/jwt.strategy.ts Passport JWT strategy validating Bearer tokens against Redis blacklist

services/gateway/src/session/session.module.ts Session module: SessionService, SessionController, Redis-backed session store

services/gateway/src/session/session.service.ts Creates, retrieves, updates, destroys sessions in Redis with TTL; maps session_id to user_id and persona_id

services/gateway/src/session/session.controller.ts GET /session/:id, POST /session/create, DELETE /session/:id with JWT guard

services/gateway/src/stream/stream.module.ts Stream module: StreamController for HTTP stream lifecycle endpoints

services/gateway/src/stream/stream.controller.ts POST /stream/start, POST /stream/stop — publishes events to Redis pub/sub for AI Orchestrator consumption

services/gateway/src/persona/persona.module.ts Persona proxy module: forwards requests to PersonaPlex internal service via HTTP

services/gateway/src/persona/persona.controller.ts GET /persona/:id, POST /persona/load, POST /persona/switch — proxied to PersonaPlex with internal auth header

services/gateway/src/websocket/voice.gateway.ts Socket.IO gateway: handles audio chunk events (audio:chunk, audio:start, audio:stop), authenticates via JWT handshake, routes chunks to Redis stream queue, emits AI audio responses back to client

services/gateway/src/websocket/websocket.module.ts WebSocket module wiring VoiceGateway with Redis pub/sub subscriber for AI response events

services/gateway/src/health/health.controller.ts GET /health — returns service status, Redis ping, MongoDB ping, upstream service reachability

services/gateway/src/common/redis.service.ts Shared Redis client wrapper using ioredis: pub/sub, get/set/del, stream operations, connection pooling

services/gateway/src/common/logger.service.ts Winston structured logger with JSON output, log levels from env, request correlation IDs

services/personaplex/Dockerfile FastAPI PersonaPlex container: Python 3.11 slim base, installs requirements, runs with uvicorn workers

services/personaplex/requirements.txt FastAPI, uvicorn, motor (async MongoDB), redis[asyncio], pydantic, httpx, python-jose, langchain-core (for summarization utilities)

services/personaplex/main.py FastAPI app factory: mounts persona router, memory router, prompt router, adds internal auth middleware, startup/shutdown Redis and MongoDB connections

services/personaplex/routers/persona.py POST /persona/load, POST /persona/switch, GET /persona/state — loads persona from MongoDB, caches in Redis, returns assembled persona profile

services/personaplex/routers/memory.py POST /memory/store (persist turn to MongoDB + update Redis context window), POST /memory/retrieve (Redis-first with MongoDB fallback, returns ranked memory refs)

services/personaplex/routers/prompt.py POST /prompt/build — assembles final system prompt from persona profile + emotional state + active context window + memory refs in <30ms, returns prompt string

services/personaplex/services/persona_service.py Business logic: load persona by ID from MongoDB, apply language adaptation, track emotional state transitions, persist updates

services/personaplex/services/memory_service.py Hot/cold memory strategy: writes to Redis active_context_window (capped sliding window), async writes to MongoDB, triggers summarization worker when window exceeds threshold

services/personaplex/services/prompt_service.py Prompt assembly engine: injects persona system prompt, emotional modifiers, language instructions, summarized memory, recent context window into final prompt dict

services/personaplex/workers/summarization_worker.py Async background worker: polls Redis for sessions needing summarization, calls LLM summarization endpoint, stores summary to MongoDB, updates memory_refs in session

services/personaplex/models/persona.py Pydantic models: PersonaProfile, EmotionState, LanguageConfig, PersonaSession matching the PersonaPlex session model schema

services/personaplex/models/memory.py Pydantic models: MemoryEntry, ContextWindow, MemoryRef, SummarizedMemory

services/personaplex/middleware/internal_auth.py FastAPI middleware validating X-Internal-Token header on all requests — PersonaPlex is never publicly exposed

services/personaplex/db/mongo.py Motor async MongoDB client: connection pool, collections for personas, memories, conversation_history, analytics

services/personaplex/db/redis_client.py aioredis async client: session hot cache, pub/sub, context window storage with TTL management

services/ai-orchestrator/Dockerfile Python 3.11 orchestrator container with asyncio, redis, httpx dependencies

services/ai-orchestrator/requirements.txt redis[asyncio], httpx, asyncio, pydantic, structlog, prometheus-client

services/ai-orchestrator/main.py Orchestrator entrypoint: starts Redis pub/sub listeners, GPU worker pool, session router, inference scheduler event loops

services/ai-orchestrator/orchestrator/session_router.py Routes incoming session stream requests to available GPU workers; maintains session-to-worker mapping in Redis; handles worker failover

services/ai-orchestrator/orchestrator/gpu_scheduler.py GPU worker pool manager: tracks worker availability, GPU memory headroom, assigns sessions to least-loaded workers, implements backpressure

services/ai-orchestrator/orchestrator/stream_lifecycle.py Manages full stream lifecycle: start → PersonaPlex prompt build → Moshi inference → audio stream → stop; handles interruptions and reconnects

services/ai-orchestrator/orchestrator/pubsub_coordinator.py Redis pub/sub coordinator: subscribes to session:start, session:stop, audio:chunk channels; publishes to inference:queue and audio:response channels

services/ai-orchestrator/orchestrator/service_discovery.py Internal service registry: resolves Moshi runtime, PersonaPlex, telephony service URLs from env/Redis; health-checks upstreams

services/ai-orchestrator/orchestrator/failover.py Failover handler: detects worker crashes via heartbeat, reassigns sessions, publishes reconnect events to gateway

services/moshi-runtime/Dockerfile GPU-enabled container: FROM nvcr.io/nvidia/pytorch:24.01-py3, installs moshi, transformers, vllm/TGI, transformer-engine, bitsandbytes, CUDA toolkit

services/moshi-runtime/requirements.txt torch, moshi, transformers, accelerate, bitsandbytes, transformer-engine, vllm, redis[asyncio], asyncio, numpy, soundfile

services/moshi-runtime/main.py Moshi runtime entrypoint: loads model to GPU, starts Redis queue consumer, starts async inference loop, starts audio stream publisher

services/moshi-runtime/runtime/model_loader.py Loads Moshi model from HuggingFace Hub with auth token, configures FP8/BF16 precision, moves to CUDA device, initializes KV-cache

services/moshi-runtime/runtime/inference_engine.py Async streaming inference: consumes audio chunks from Redis queue, runs Moshi forward pass with CUDA streams, yields token stream, publishes audio response chunks to Redis

services/moshi-runtime/runtime/audio_processor.py FFmpeg-backed audio transcoding: converts incoming WebRTC audio to model input format, converts model output to client-compatible format (opus/pcm)

services/moshi-runtime/runtime/kv_cache_manager.py KV-cache lifecycle: allocates per-session cache slots, evicts on session end, implements memory pooling to avoid CUDA OOM

services/moshi-runtime/runtime/cuda_stream_manager.py CUDA stream pool: assigns dedicated CUDA streams per session, enables concurrent inference without serialization

services/moshi-runtime/runtime/dynamic_batcher.py Dynamic batching: collects inference requests within a configurable time window, batches compatible requests, dispatches to GPU, unbatches responses

services/moshi-runtime/runtime/fp8_optimizer.py transformer-engine FP8 quantization wrapper: wraps model layers for FP8 forward pass on supported GPUs (H100, L40S, A100)

services/moshi-runtime/runtime/memory_pool.py GPU memory pool: pre-allocates tensor buffers for expected concurrent sessions, recycles buffers on session end to avoid fragmentation

services/moshi-runtime/runtime/latency_monitor.py Per-request latency tracking: measures time-to-first-token, total generation time, publishes metrics to Prometheus pushgateway

services/telephony/Dockerfile Python 3.11 telephony service container with Asterisk ARI client, RTP libraries

services/telephony/requirements.txt panoramisk (Asterisk ARI), asyncio, redis[asyncio], httpx, pydantic, structlog

services/telephony/main.py Telephony service entrypoint: connects to Asterisk ARI websocket, starts call event listener, starts RTP bridge manager

services/telephony/telephony/ari_client.py Asterisk ARI client: handles StasisStart/StasisEnd events, answers calls, bridges channels, controls call lifecycle

services/telephony/telephony/rtp_bridge.py RTP bridge: receives RTP audio from Asterisk, converts to PCM chunks, publishes to Redis audio queue; subscribes to AI audio responses, sends RTP back to Asterisk

services/telephony/telephony/sip_handler.py SIP session management: maps SIP call IDs to Aziza session IDs, creates sessions in gateway via internal API, handles call routing

services/telephony/telephony/call_lifecycle.py Call lifecycle manager: start call → create session → stream audio → end call → cleanup session; handles reconnects and call drops

services/telephony/telephony/recording.py Optional call recording: streams audio to file or object storage, manages recording lifecycle per call

monitoring/prometheus/prometheus.yml Prometheus scrape config: targets for NestJS metrics, FastAPI metrics, Moshi runtime metrics, Redis exporter, MongoDB exporter, NVIDIA GPU exporter

monitoring/grafana/dashboards/aziza-overview.json Grafana dashboard: GPU memory utilization, concurrent sessions, token throughput, WebSocket connections, latency percentiles, error rates

monitoring/grafana/dashboards/gpu-performance.json GPU-specific dashboard: per-GPU memory, CUDA utilization, FP8 throughput, KV-cache hit rate, OOM events

monitoring/grafana/dashboards/session-health.json Session health dashboard: active sessions, session creation rate, session errors, persona load times, prompt assembly latency

monitoring/loki/loki-config.yml Loki log aggregation config: ingests structured JSON logs from all services via Promtail, retention policies

monitoring/promtail/promtail-config.yml Promtail config: scrapes Docker container logs, adds service labels, ships to Loki

monitoring/alertmanager/alerts.yml Alert rules: GPU OOM, latency >200ms, session failure rate >5%, Redis connection loss, MongoDB write failures

fine-tuning/Dockerfile Fine-tuning container: GPU-enabled, installs PEFT, LoRA, QLoRA, bitsandbytes, datasets, accelerate, transformers

fine-tuning/requirements.txt peft, transformers, datasets, accelerate, bitsandbytes, torch, sentencepiece, evaluate

fine-tuning/scripts/prepare_dataset.py Dataset preparation: loads Russian, Uzbek Latin, Uzbek Cyrillic, English conversation datasets, formats for instruction tuning, splits train/eval

fine-tuning/scripts/audit_tokenizer.py Tokenizer audit: checks coverage for all 4 languages, identifies missing tokens, generates extension vocabulary if needed

fine-tuning/scripts/train_lora.py LoRA/QLoRA training script: loads base model, applies PEFT LoRA config, trains on multilingual dataset, saves adapter weights

fine-tuning/scripts/validate_streaming.py Post-training validation: loads adapter, runs streaming inference, measures latency impact, validates multilingual response quality

fine-tuning/configs/lora_config.json LoRA hyperparameters: rank, alpha, target modules, dropout, task type — tuned for minimal latency impact

fine-tuning/configs/training_args.json Training arguments: batch size, gradient accumulation, learning rate, warmup, FP16/BF16, output dir

scripts/deploy.sh Production deployment script: pulls latest images, runs docker-compose up with GPU flags, waits for health checks, runs smoke tests

scripts/health_check.sh Validates all services are healthy: curls /health endpoints, checks Redis ping, MongoDB connection, GPU visibility

scripts/gpu_check.sh Verifies NVIDIA Container Toolkit setup, GPU visibility in containers, CUDA version compatibility

scripts/seed_personas.py Seeds MongoDB with initial persona profiles for testing: multilingual personas with emotional state configs

shared/schemas/session.schema.json Shared PersonaPlex session model JSON schema used across services for validation

shared/schemas/persona.schema.json Shared persona profile schema: name, language, emotional_range, system_prompt_template, memory_config

shared/proto/aziza.proto Optional gRPC proto definitions for high-performance internal service communication (Orchestrator ↔ Moshi runtime)

Features (12)

Phase 1: Dockerized GPU Infrastructure Foundation P1

Establishes the complete containerized infrastructure with GPU support, Redis, MongoDB, and NGINX as the stable foundation for all subsequent services.

docker-compose up brings all infrastructure containers online without errors
nvidia-smi is accessible inside GPU-enabled containers confirming CUDA visibility
Redis responds to PING with PONG and persists data across container restarts
MongoDB accepts authenticated connections and persists data across restarts
NGINX routes /api and /ws traffic correctly to upstream services
All containers have health checks defined and pass within 30 seconds of startup
Environment variables are loaded from .env file with no hardcoded secrets
Centralized logging aggregates container logs to Loki via Promtail

Phase 2: NestJS API Gateway with JWT Auth and WebSocket P2

Builds the sole external-facing backend service handling authentication, session management, rate limiting, and WebSocket audio event routing.

POST /auth/login returns signed JWT access token and refresh token
POST /auth/refresh rotates refresh token and issues new access token
Expired or blacklisted tokens return 401 on all protected endpoints
WebSocket connections require valid JWT in handshake query or header
Rate limiting blocks requests exceeding configured thresholds per IP
GET /health returns 200 with Redis and MongoDB connectivity status
POST /session/create returns a unique session_id stored in Redis
Socket.IO audio:chunk events are received and published to Redis stream queue
Structured JSON logs include correlation IDs for every request

Phase 3: Real-Time Duplex Audio Streaming Pipeline P3

Establishes the end-to-end async audio streaming pipeline from WebSocket client through FFmpeg transcoding to Redis queues and back.

Audio chunks received via WebSocket are published to Redis stream within 10ms
FFmpeg transcodes incoming WebRTC audio (opus) to model-compatible PCM format without blocking
Audio response chunks are streamed back to client as they are generated, not buffered
Client disconnection triggers graceful stream cleanup without orphaned Redis entries
Reconnecting client resumes session state from Redis within 500ms
Interruption signal (user speaks while AI responds) halts current generation within 50ms
Stream jitter remains below 20ms under normal network conditions
Async queue depth stays below 50 chunks under sustained load

Phase 4: Moshi Streaming Inference Runtime P4

Deploys the GPU-backed Moshi model with async token streaming, KV-cache management, and Redis queue integration for real-time AI voice generation.

Moshi model loads successfully from HuggingFace Hub using authenticated token
Model is loaded to CUDA device and confirmed via torch.cuda.is_available()
First token is generated within 120ms of receiving audio input chunk
Tokens are streamed to Redis audio:response channel as they are generated
KV-cache is allocated per session and released on session end
No GPU memory leaks detected after 100 consecutive session cycles
Inference runs fully async without blocking the event loop
Model handles empty or malformed audio input gracefully without crashing

Phase 5: GPU Optimization for Concurrent Sessions P5

Optimizes the inference runtime for FP8 precision, CUDA stream parallelism, dynamic batching, and memory pooling to support 8-10 concurrent sessions at sub-120ms latency.

FP8 inference enabled via transformer-engine on H100/L40S GPUs with measurable throughput improvement
8 concurrent streaming sessions run simultaneously without CUDA OOM errors
Per-session CUDA streams prevent serialization between concurrent inference calls
Dynamic batcher groups compatible requests within 5ms window before GPU dispatch
GPU memory pool pre-allocates buffers for max_sessions and recycles without fragmentation
Time-to-first-token remains below 120ms at 8 concurrent sessions
Latency monitor publishes p50/p95/p99 metrics to Prometheus every 10 seconds
GPU scheduler rejects new sessions when memory headroom is insufficient rather than crashing

Phase 6: PersonaPlex Microservice — Persona and Memory Orchestration P6

Deploys the isolated PersonaPlex FastAPI service providing persona profile management, Redis-backed hot memory, MongoDB cold storage, multilingual adaptation, emotional state tracking, and sub-30ms prompt assembly.

POST /persona/load retrieves persona from MongoDB, caches in Redis, returns within 50ms
POST /persona/switch atomically updates session persona in Redis and MongoDB
GET /persona/state returns current emotional state and active context window from Redis
POST /memory/store writes conversation turn to Redis context window and async to MongoDB
POST /memory/retrieve returns ranked memory refs from Redis-first with MongoDB fallback
POST /prompt/build assembles complete system prompt in under 30ms
Summarization worker triggers when context window exceeds 20 turns and stores summary to MongoDB
Persona state persists correctly across session reconnects
Multilingual persona correctly adapts system prompt language based on session language field
PersonaPlex rejects all requests without valid X-Internal-Token header with 403
Emotional state transitions are tracked and stored per session turn

Phase 7: AI Orchestrator — Centralized Session and GPU Coordination P7

Builds the Python asyncio orchestration service that coordinates session routing, GPU worker allocation, stream lifecycle, Redis pub/sub, and failover across all runtime services.

Orchestrator subscribes to session:start Redis channel and routes to available GPU worker within 20ms
Session-to-worker mapping stored in Redis prevents cross-session data leakage
GPU scheduler assigns sessions to workers with sufficient memory headroom
Worker heartbeat failure triggers session reassignment within 5 seconds
Stream lifecycle correctly sequences: prompt build → inference start → audio stream → session end
Redis pub/sub coordinator handles 50 concurrent channel subscriptions without message loss
Service discovery resolves all upstream URLs from environment and validates health on startup
Orchestrator exposes /health endpoint confirming all downstream service connectivity

Phase 9: Telephony Integration via Asterisk ARI P8

Enables AI voice conversations over traditional telephony by integrating Asterisk ARI for SIP call handling and RTP audio bridging to the Aziza streaming pipeline.

Asterisk ARI WebSocket connection established and StasisStart events received on incoming calls
Incoming SIP call creates a corresponding Aziza session via internal API
RTP audio from caller is converted to PCM and published to Redis audio queue
AI audio responses are received from Redis and sent back via RTP to caller
Call end (StasisEnd) triggers graceful session cleanup
RTP round-trip latency stays below 200ms under normal conditions
Call recording optionally captures both sides of conversation to file
Telephony service reconnects to Asterisk ARI automatically after connection loss

Phase 10: Monitoring and Observability Stack P9

Deploys Prometheus, Grafana, Loki, and alerting to provide full production observability across GPU performance, session health, latency, and error rates.

Prometheus scrapes metrics from all services every 15 seconds without gaps
Grafana overview dashboard shows GPU memory, concurrent sessions, token throughput, and WebSocket connections in real time
GPU performance dashboard shows per-GPU CUDA utilization, FP8 throughput, and KV-cache hit rate
Loki receives structured JSON logs from all containers with correct service labels
Alert fires within 2 minutes when GPU memory exceeds 90% utilization
Alert fires when p95 latency exceeds 200ms for 5 consecutive minutes
Alert fires when session error rate exceeds 5% over a 1-minute window
NVIDIA GPU exporter exposes dcgm metrics for L40S/A100/H100 GPU health

Phase 11: Multilingual Fine-Tuning with LoRA Adapters P10

Creates language-specific LoRA adapters for Russian, Uzbek Latin, Uzbek Cyrillic, and English that integrate with the streaming runtime without degrading latency.

Dataset preparation script produces formatted train/eval splits for all 4 languages
Tokenizer audit confirms adequate coverage for Uzbek Cyrillic and Latin scripts
LoRA adapter training completes without OOM on target GPU with QLoRA if needed
Adapter weights are under 500MB per language for practical deployment
Loading LoRA adapter adds less than 15ms to model initialization time
Streaming inference with adapter maintains time-to-first-token below 150ms
Multilingual response quality validated by native speaker evaluation on 50-turn test set
Adapter can be hot-swapped per session based on PersonaPlex language field

JWT Authentication and WebSocket Security P1

Ensures all external communication is authenticated and all internal services are isolated from public access.

All REST endpoints except /auth/login and /auth/register require valid JWT
WebSocket handshake validates JWT before accepting connection
Refresh token rotation invalidates previous refresh token on use
Internal services (PersonaPlex, Orchestrator, Moshi) are not exposed via NGINX
X-Internal-Token header required for all inter-service communication
Redis and MongoDB require authentication credentials from environment variables
HTTPS enforced at NGINX layer with TLS termination

PersonaPlex Session Model and Hot/Cold Memory Strategy P2

Implements the defined session schema and two-tier memory architecture ensuring sub-30ms context retrieval for active sessions.

Session model matches defined schema: session_id, user_id, persona_id, language, emotion_state, conversation_summary, active_context_window, memory_refs
Active context window stored in Redis with configurable TTL and sliding window eviction
Emotional state updated per turn and reflected in next prompt assembly
Summarized memories stored in MongoDB and referenced via memory_refs array
Redis-first retrieval returns context in under 5ms for active sessions
MongoDB fallback retrieves cold memory in under 100ms
Memory persists correctly when session reconnects after disconnect

Build Log

scoping Starting AI-powered tech spec generation

scoping Tech spec generated successfully

start Build orchestration started for project 2

attempt Build attempt 1/3

generate Attempt 1 failed: AI generation failed: Unterminated string in JSON at position 45824 failed

retry Retrying (2/3)...

attempt Build attempt 2/3

generate Attempt 2 failed: AI generation failed: 429 Daily token limit reached (100,000 tokens). Resets at midnight UTC. failed

retry Retrying (3/3)...

attempt Build attempt 3/3

generate Attempt 3 failed: AI generation failed: 429 Daily token limit reached (100,000 tokens). Resets at midnight UTC. failed

complete Build failed after 3 attempts failed

status Project status updated to Build Failed

Deliverables

📦

Deliverables become available once project reaches Review status.