docker-compose.yml
Orchestrates all services: nginx, nestjs-gateway, personaplex, ai-orchestrator, moshi-runtime, telephony, redis, mongodb, prometheus, grafana, loki
.env.example
All environment variable templates: JWT secrets, Redis/Mongo credentials, HuggingFace tokens, GPU config, service URLs — never hardcoded
nginx/nginx.conf
NGINX reverse proxy config: routes /api to NestJS gateway, /ws to websocket layer, /metrics to Prometheus, enforces HTTPS, rate limiting headers
nginx/Dockerfile
NGINX container build with SSL cert mounting and custom config
services/gateway/Dockerfile
NestJS API Gateway container build, Node.js 20 LTS base, production build step
services/gateway/package.json
NestJS dependencies: @nestjs/core, @nestjs/websockets, socket.io, passport-jwt, throttler, class-validator, winston
services/gateway/src/main.ts
NestJS bootstrap: enables CORS, Socket.IO adapter, global validation pipe, Helmet security, starts on port 3001
services/gateway/src/app.module.ts
Root module importing AuthModule, SessionModule, StreamModule, PersonaModule, HealthModule, WebSocketGatewayModule
services/gateway/src/auth/auth.module.ts
Auth module wiring: JwtModule, PassportModule, AuthService, AuthController, JwtStrategy, RefreshTokenStrategy
services/gateway/src/auth/auth.service.ts
JWT issuance, refresh token rotation, bcrypt password hashing, token blacklist via Redis
services/gateway/src/auth/auth.controller.ts
POST /auth/login, POST /auth/register, POST /auth/refresh, POST /auth/logout endpoints
services/gateway/src/auth/jwt.strategy.ts
Passport JWT strategy validating Bearer tokens against Redis blacklist
services/gateway/src/session/session.module.ts
Session module: SessionService, SessionController, Redis-backed session store
services/gateway/src/session/session.service.ts
Creates, retrieves, updates, destroys sessions in Redis with TTL; maps session_id to user_id and persona_id
services/gateway/src/session/session.controller.ts
GET /session/:id, POST /session/create, DELETE /session/:id with JWT guard
services/gateway/src/stream/stream.module.ts
Stream module: StreamController for HTTP stream lifecycle endpoints
services/gateway/src/stream/stream.controller.ts
POST /stream/start, POST /stream/stop — publishes events to Redis pub/sub for AI Orchestrator consumption
services/gateway/src/persona/persona.module.ts
Persona proxy module: forwards requests to PersonaPlex internal service via HTTP
services/gateway/src/persona/persona.controller.ts
GET /persona/:id, POST /persona/load, POST /persona/switch — proxied to PersonaPlex with internal auth header
services/gateway/src/websocket/voice.gateway.ts
Socket.IO gateway: handles audio chunk events (audio:chunk, audio:start, audio:stop), authenticates via JWT handshake, routes chunks to Redis stream queue, emits AI audio responses back to client
services/gateway/src/websocket/websocket.module.ts
WebSocket module wiring VoiceGateway with Redis pub/sub subscriber for AI response events
services/gateway/src/health/health.controller.ts
GET /health — returns service status, Redis ping, MongoDB ping, upstream service reachability
services/gateway/src/common/redis.service.ts
Shared Redis client wrapper using ioredis: pub/sub, get/set/del, stream operations, connection pooling
services/gateway/src/common/logger.service.ts
Winston structured logger with JSON output, log levels from env, request correlation IDs
services/personaplex/Dockerfile
FastAPI PersonaPlex container: Python 3.11 slim base, installs requirements, runs with uvicorn workers
services/personaplex/requirements.txt
FastAPI, uvicorn, motor (async MongoDB), redis[asyncio], pydantic, httpx, python-jose, langchain-core (for summarization utilities)
services/personaplex/main.py
FastAPI app factory: mounts persona router, memory router, prompt router, adds internal auth middleware, startup/shutdown Redis and MongoDB connections
services/personaplex/routers/persona.py
POST /persona/load, POST /persona/switch, GET /persona/state — loads persona from MongoDB, caches in Redis, returns assembled persona profile
services/personaplex/routers/memory.py
POST /memory/store (persist turn to MongoDB + update Redis context window), POST /memory/retrieve (Redis-first with MongoDB fallback, returns ranked memory refs)
services/personaplex/routers/prompt.py
POST /prompt/build — assembles final system prompt from persona profile + emotional state + active context window + memory refs in <30ms, returns prompt string
services/personaplex/services/persona_service.py
Business logic: load persona by ID from MongoDB, apply language adaptation, track emotional state transitions, persist updates
services/personaplex/services/memory_service.py
Hot/cold memory strategy: writes to Redis active_context_window (capped sliding window), async writes to MongoDB, triggers summarization worker when window exceeds threshold
services/personaplex/services/prompt_service.py
Prompt assembly engine: injects persona system prompt, emotional modifiers, language instructions, summarized memory, recent context window into final prompt dict
services/personaplex/workers/summarization_worker.py
Async background worker: polls Redis for sessions needing summarization, calls LLM summarization endpoint, stores summary to MongoDB, updates memory_refs in session
services/personaplex/models/persona.py
Pydantic models: PersonaProfile, EmotionState, LanguageConfig, PersonaSession matching the PersonaPlex session model schema
services/personaplex/models/memory.py
Pydantic models: MemoryEntry, ContextWindow, MemoryRef, SummarizedMemory
services/personaplex/middleware/internal_auth.py
FastAPI middleware validating X-Internal-Token header on all requests — PersonaPlex is never publicly exposed
services/personaplex/db/mongo.py
Motor async MongoDB client: connection pool, collections for personas, memories, conversation_history, analytics
services/personaplex/db/redis_client.py
aioredis async client: session hot cache, pub/sub, context window storage with TTL management
services/ai-orchestrator/Dockerfile
Python 3.11 orchestrator container with asyncio, redis, httpx dependencies
services/ai-orchestrator/requirements.txt
redis[asyncio], httpx, asyncio, pydantic, structlog, prometheus-client
services/ai-orchestrator/main.py
Orchestrator entrypoint: starts Redis pub/sub listeners, GPU worker pool, session router, inference scheduler event loops
services/ai-orchestrator/orchestrator/session_router.py
Routes incoming session stream requests to available GPU workers; maintains session-to-worker mapping in Redis; handles worker failover
services/ai-orchestrator/orchestrator/gpu_scheduler.py
GPU worker pool manager: tracks worker availability, GPU memory headroom, assigns sessions to least-loaded workers, implements backpressure
services/ai-orchestrator/orchestrator/stream_lifecycle.py
Manages full stream lifecycle: start → PersonaPlex prompt build → Moshi inference → audio stream → stop; handles interruptions and reconnects
services/ai-orchestrator/orchestrator/pubsub_coordinator.py
Redis pub/sub coordinator: subscribes to session:start, session:stop, audio:chunk channels; publishes to inference:queue and audio:response channels
services/ai-orchestrator/orchestrator/service_discovery.py
Internal service registry: resolves Moshi runtime, PersonaPlex, telephony service URLs from env/Redis; health-checks upstreams
services/ai-orchestrator/orchestrator/failover.py
Failover handler: detects worker crashes via heartbeat, reassigns sessions, publishes reconnect events to gateway
services/moshi-runtime/Dockerfile
GPU-enabled container: FROM nvcr.io/nvidia/pytorch:24.01-py3, installs moshi, transformers, vllm/TGI, transformer-engine, bitsandbytes, CUDA toolkit
services/moshi-runtime/requirements.txt
torch, moshi, transformers, accelerate, bitsandbytes, transformer-engine, vllm, redis[asyncio], asyncio, numpy, soundfile
services/moshi-runtime/main.py
Moshi runtime entrypoint: loads model to GPU, starts Redis queue consumer, starts async inference loop, starts audio stream publisher
services/moshi-runtime/runtime/model_loader.py
Loads Moshi model from HuggingFace Hub with auth token, configures FP8/BF16 precision, moves to CUDA device, initializes KV-cache
services/moshi-runtime/runtime/inference_engine.py
Async streaming inference: consumes audio chunks from Redis queue, runs Moshi forward pass with CUDA streams, yields token stream, publishes audio response chunks to Redis
services/moshi-runtime/runtime/audio_processor.py
FFmpeg-backed audio transcoding: converts incoming WebRTC audio to model input format, converts model output to client-compatible format (opus/pcm)
services/moshi-runtime/runtime/kv_cache_manager.py
KV-cache lifecycle: allocates per-session cache slots, evicts on session end, implements memory pooling to avoid CUDA OOM
services/moshi-runtime/runtime/cuda_stream_manager.py
CUDA stream pool: assigns dedicated CUDA streams per session, enables concurrent inference without serialization
services/moshi-runtime/runtime/dynamic_batcher.py
Dynamic batching: collects inference requests within a configurable time window, batches compatible requests, dispatches to GPU, unbatches responses
services/moshi-runtime/runtime/fp8_optimizer.py
transformer-engine FP8 quantization wrapper: wraps model layers for FP8 forward pass on supported GPUs (H100, L40S, A100)
services/moshi-runtime/runtime/memory_pool.py
GPU memory pool: pre-allocates tensor buffers for expected concurrent sessions, recycles buffers on session end to avoid fragmentation
services/moshi-runtime/runtime/latency_monitor.py
Per-request latency tracking: measures time-to-first-token, total generation time, publishes metrics to Prometheus pushgateway
services/telephony/Dockerfile
Python 3.11 telephony service container with Asterisk ARI client, RTP libraries
services/telephony/requirements.txt
panoramisk (Asterisk ARI), asyncio, redis[asyncio], httpx, pydantic, structlog
services/telephony/main.py
Telephony service entrypoint: connects to Asterisk ARI websocket, starts call event listener, starts RTP bridge manager
services/telephony/telephony/ari_client.py
Asterisk ARI client: handles StasisStart/StasisEnd events, answers calls, bridges channels, controls call lifecycle
services/telephony/telephony/rtp_bridge.py
RTP bridge: receives RTP audio from Asterisk, converts to PCM chunks, publishes to Redis audio queue; subscribes to AI audio responses, sends RTP back to Asterisk
services/telephony/telephony/sip_handler.py
SIP session management: maps SIP call IDs to Aziza session IDs, creates sessions in gateway via internal API, handles call routing
services/telephony/telephony/call_lifecycle.py
Call lifecycle manager: start call → create session → stream audio → end call → cleanup session; handles reconnects and call drops
services/telephony/telephony/recording.py
Optional call recording: streams audio to file or object storage, manages recording lifecycle per call
monitoring/prometheus/prometheus.yml
Prometheus scrape config: targets for NestJS metrics, FastAPI metrics, Moshi runtime metrics, Redis exporter, MongoDB exporter, NVIDIA GPU exporter
monitoring/grafana/dashboards/aziza-overview.json
Grafana dashboard: GPU memory utilization, concurrent sessions, token throughput, WebSocket connections, latency percentiles, error rates
monitoring/grafana/dashboards/gpu-performance.json
GPU-specific dashboard: per-GPU memory, CUDA utilization, FP8 throughput, KV-cache hit rate, OOM events
monitoring/grafana/dashboards/session-health.json
Session health dashboard: active sessions, session creation rate, session errors, persona load times, prompt assembly latency
monitoring/loki/loki-config.yml
Loki log aggregation config: ingests structured JSON logs from all services via Promtail, retention policies
monitoring/promtail/promtail-config.yml
Promtail config: scrapes Docker container logs, adds service labels, ships to Loki
monitoring/alertmanager/alerts.yml
Alert rules: GPU OOM, latency >200ms, session failure rate >5%, Redis connection loss, MongoDB write failures
fine-tuning/Dockerfile
Fine-tuning container: GPU-enabled, installs PEFT, LoRA, QLoRA, bitsandbytes, datasets, accelerate, transformers
fine-tuning/requirements.txt
peft, transformers, datasets, accelerate, bitsandbytes, torch, sentencepiece, evaluate
fine-tuning/scripts/prepare_dataset.py
Dataset preparation: loads Russian, Uzbek Latin, Uzbek Cyrillic, English conversation datasets, formats for instruction tuning, splits train/eval
fine-tuning/scripts/audit_tokenizer.py
Tokenizer audit: checks coverage for all 4 languages, identifies missing tokens, generates extension vocabulary if needed
fine-tuning/scripts/train_lora.py
LoRA/QLoRA training script: loads base model, applies PEFT LoRA config, trains on multilingual dataset, saves adapter weights
fine-tuning/scripts/validate_streaming.py
Post-training validation: loads adapter, runs streaming inference, measures latency impact, validates multilingual response quality
fine-tuning/configs/lora_config.json
LoRA hyperparameters: rank, alpha, target modules, dropout, task type — tuned for minimal latency impact
fine-tuning/configs/training_args.json
Training arguments: batch size, gradient accumulation, learning rate, warmup, FP16/BF16, output dir
scripts/deploy.sh
Production deployment script: pulls latest images, runs docker-compose up with GPU flags, waits for health checks, runs smoke tests
scripts/health_check.sh
Validates all services are healthy: curls /health endpoints, checks Redis ping, MongoDB connection, GPU visibility
scripts/gpu_check.sh
Verifies NVIDIA Container Toolkit setup, GPU visibility in containers, CUDA version compatibility
scripts/seed_personas.py
Seeds MongoDB with initial persona profiles for testing: multilingual personas with emotional state configs
shared/schemas/session.schema.json
Shared PersonaPlex session model JSON schema used across services for validation
shared/schemas/persona.schema.json
Shared persona profile schema: name, language, emotional_range, system_prompt_template, memory_config
shared/proto/aziza.proto
Optional gRPC proto definitions for high-performance internal service communication (Orchestrator ↔ Moshi runtime)