An execution-ready engineering breakdown of the AI Security & Governance Platform — every tradeoff named, every failure mode mapped, every shortcut flagged.
Zorianto Astral is a real-time AI interaction governance platform that wraps an organization's entire AI surface — browsers, API endpoints, desktop processes, agentic pipelines, and MCP server connections — with an inline enforcement layer capable of blocking, redacting, or modifying interactions in under 50ms, while simultaneously running asynchronous compliance gap analysis against seven regulatory frameworks (HIPAA, GDPR, SOC 2, PCI-DSS, EU AI Act, NIST AI RMF, and internal AI Governance). It is not a SIEM add-on or a passive log aggregator; it is an active enforcement proxy that intercepts traffic and applies policies before damage occurs — combined with a governance layer for agentic AI workflows and machine identities at a 100:1 NHI-to-human ratio.
The PRD states L3 AI Judge runs in <100ms AND inline enforcement completes in <50ms. These are contradictory. Calling an LLM API (even Haiku/Gemini Flash) for every interaction takes 200–800ms on a warm connection. Mitigation: L3 must be triggered asynchronously — inline path uses only L1+L2 (<15ms), with L3 escalating to a HOLD state for high-confidence heuristic hits. This is a fundamental architecture change from what the PRD implies.
OQ-04 asks shared vs dedicated. This is not a product question — it's an architectural fork. Shared infra with RLS and tenant namespacing takes 6 weeks to get right. Dedicated per-customer means 10x infra costs at low customer counts. Mitigation: MVP ships shared with strict Postgres Row-Level Security. Dedicated is an enterprise tier add-on at Phase 3+. Decision must be locked before week 2.
A5 assumes "ARMOR framework integration for unauthorized fine-tuning detection is available as a dependency." ARMOR is not a widely known open framework as of April 2026. This is likely an internal or partner dependency. If unavailable, the fine-tuning detection feature (Luxion L4) is blocked. Mitigation: Document this as a P2 feature with a clear go/no-go gate at week 20.
Auto-classifying systems as Prohibited/High-Risk/Limited/Minimal per Annex III requires legal interpretation, not just ML. A model misclassifying a High-Risk system as Minimal creates legal liability. Mitigation: Stellix auto-classification is an advisory score + human confirmation workflow, never automated gate-keeping without human sign-off.
Spectral audio analysis of a 30-second clip + behavioral baseline comparison is a 1–4 second operation with GPU, longer without. <500ms requires dedicated GPU inference infrastructure or aggressive audio chunking. Mitigation: L4 is never in the inline path. It's always async post-hoc analysis triggered by L2/L3 signals. Wire fraud authorization should go through a separate multi-factor CHALLENGE flow regardless.
Vigil (agent governance) — cut because it requires a separate proxy infrastructure, agent registration model, and kill-switch mechanism. Adding this to the MVP doubles scope and delays the sellable core by 6 weeks.
Sentinel NHI — cut because NHI behavioral anomaly detection requires 30+ days of baseline data before it produces signal. Shipping it empty is confusing.
L3 AI Judge + L4 Deepfake — cut because the async inference service is a separate platform. Ship L1+L2 with 90%+ coverage of common threats.
Blast Radius Simulator — cut because it requires agent topology data that doesn't exist until Vigil ships.
EU AI Act countdown + all 7 frameworks — reduce to HIPAA + GDPR in MVP; the clock-based EU feature is powerful in demos but requires Stellix inventory to be populated first.
| Layer | Choice | Why | Rejected Alternatives |
|---|---|---|---|
| Enforcement Service | Go 1.22+ | Goroutine-per-connection model handles 10k RPS with sub-10ms routing overhead. Net/http2 + context cancellation fits the <50ms pipeline perfectly. Compiled binary with no JVM warmup or GC pauses. | Rust: better performance ceiling but 2x development time; team ramp-up is a real cost. Node.js: event loop blocking during DLP regex at high concurrency; GC pressure at 10k RPS is observed. |
| Policy Engine | OPA (Rego) | Sub-millisecond policy evaluation; battle-tested in Kubernetes admission controllers; native bundle compilation; supports "policy as code" git workflows. The partial evaluation feature enables policy simulation cheaply. | Custom rule engine: faster initially, but becomes a maintenance nightmare at 48+ active policies. Cedar (AWS): newer, less community tooling. |
| Primary DB | PostgreSQL 16 + TimescaleDB | TimescaleDB's hypertable chunking gives O(log n) time-series queries on audit events without changing the SQL interface. Row-Level Security handles multi-tenant isolation at the DB layer. JSONB columns for flexible event metadata. | DynamoDB: no ad-hoc queries for compliance reports; expensive at audit log volumes. ClickHouse: move analytics here at Phase 3 when query patterns stabilize (see Oraxis). |
| Session State | Redis Cluster 7.x | Sub-millisecond counter increments for rate limiting; SETEX for session TTLs; pub/sub for real-time alert streaming. Redis Cluster avoids single-node SPOF. Persistence with AOF for crash recovery. | DragonflyDB: API compatible but less production-proven at enterprise scale. Memcached: no pub/sub, no persistence. |
| Event Pipeline | Apache Kafka (MSK) | At 10k RPS sustained, Kafka's 100k+ msg/sec throughput with 7-day retention gives replay for forensics. Topic partitioning by tenant_id enables isolation. SQS hits its 3,000 msg/sec standard queue limit too easily. | SQS: PRD specifies it but throughput ceiling is insufficient. Kinesis: viable alternative, less community tooling for consumer groups. |
| Frontend | Next.js 14 (App Router) | Server Components for compliance report rendering without client hydration. API routes co-located for BFF pattern. Built-in ISR for slowly-changing compliance scores. Vercel or self-hosted on ECS. | Remix: better for highly interactive UIs but compliance dashboards are read-heavy. SPA React: no SSR for SEO/initial load. |
| Auth | Cognito + custom JWT | Cognito handles SAML/SSO for enterprise (required for >500 employee orgs). Custom JWT service for API keys with scoped claims and per-key rotation tracking. Cognito User Pools give MFA without building it. | Auth0: higher cost at scale, vendor lock-in. Keycloak: requires ops overhead. DIY: never for a security product. |
| ML Inference | Python FastAPI + Triton | Separate Python service for L3 AI Judge (async only) and L4 deepfake models. Triton Inference Server for GPU model serving with batching. Decoupled from Go enforcement service — ML failures don't affect inline path. | Inline in Go: CGo bindings for Python ML libraries are fragile and hard to deploy. Lambda: cold start latency incompatible with any latency target. |
| Orchestration | EKS (Kubernetes) | ECS Fargate is fine at low scale but lacks the pod autoscaling granularity needed for burst traffic at the enforcement gateway. EKS Karpenter for node auto-provisioning; HPA on CPU + request latency custom metrics. | ECS Fargate: PRD specifies it, but at 10k RPS target you hit Fargate task scaling lag (>60s). Kubernetes HPA reacts in <15s. |
Verdict: Modular Monolith → Selective Microservices. Start with a modular monolith (single deployable, domain-separated packages) for Phases 0–2. Extract to separate services only at proven pain points, not ahead of time.
Extracted as a separate service from Day 1. Reason: must be independently scalable to 10k RPS, independently deployable (no big-bang redeploy for policy updates), and isolated so a bug in Chronix can't affect inline enforcement uptime.
Extracted from Day 1. Reason: Python/GPU runtime environment is incompatible with the Go enforcement binary. GPU node pools are expensive — co-locate all ML workloads in one service.
Extract Oraxis onto ClickHouse when audit log exceeds 50M rows/month. OLAP queries for blast radius and cost attribution will kill the PostgreSQL primary if not separated.
Chronix, Nexion, Sentinel, Stellix, Vigil, and the Dashboard live in a single Go monolith until team size or throughput demands extraction. Premature microservices create distributed systems overhead without the team to manage it.
| Metric | Target | Strategy |
|---|---|---|
| Throughput | 10k RPS | Horizontal EKS pod scaling; Redis rate limiting per tenant |
| P95 Latency | <50ms | L1+L2 only inline; OPA bundle cached in-process; Redis local replica |
| Audit Events | ~864M/day at 10k RPS | Kafka buffering; TimescaleDB chunk compression; S3 tiering after 90 days |
| NHI Inventory | 10k identities | 4-hour scan interval; background worker pool; Redis cache for posture scores |
| Agent Sessions | 1k concurrent | Vigil Proxy stateful connections via Redis sorted sets; kill-switch via pub/sub |
99.9% = 8.77 hours downtime/year. For an inline enforcement path, this means every outage lets unscanned AI traffic through. The fail-open/fail-closed config is therefore not a convenience feature — it's a core SLA definition. Production should default fail-open for low-risk, fail-closed for high-risk (PHI tenants). Document this contractually.
Blast radius: All inline enforcement stops. Browser extensions fall through to allow-mode. API gateway returns 200 without DLP scan.
Detection: Heartbeat alert within 15 seconds; CloudWatch alarm on 5xx rate.
Recovery: EKS pod restart SLA: <45 seconds. Multi-AZ deployment ensures AZ failure doesn't cause full outage.
LLM API calls (even flash/haiku) average 200–600ms on a warm connection. The PRD's stated "<50ms inline enforcement" and "<100ms L3" are physically contradictory. The inline path uses L1 (signature, <5ms) + L2 (heuristic, <15ms) + OPA policy evaluation (<1ms). L3 is triggered only when L2 confidence is 0.65–0.94 (ambiguous zone), and the interaction enters HOLD state pending async L3 resolution — like a credit card soft decline requiring additional verification.
L3 in inline path: Adds 200–600ms to every interaction. Unacceptable. Users would bypass the system. L3 via edge LLM (local model): Viable at Phase 4 — a quantized Mistral-7B on GPU nodes could do <80ms. Deferred due to GPU infra cost and complexity in MVP.
The HOLD state creates a user-visible delay for ~8% of interactions (estimated L2 ambiguous zone). This may frustrate users if hold time exceeds 2–3 seconds. Mitigation: CHALLENGE workflow gives users a reason ("Policy review: 2s") rather than an opaque pause.
If LLM inference latency drops to <50ms (plausible with edge models), revisit inlining L3. ONNX-quantized security-tuned models may enable this at Phase 4.
Compliance reports require JOIN-heavy queries across events, policies, NHI, and agents. DynamoDB's single-table design makes these queries painful and expensive. TimescaleDB's hypertable chunking gives O(log n) time-series performance without leaving the SQL ecosystem. PostgreSQL RLS provides multi-tenant data isolation at zero application-layer complexity.
DynamoDB: PRD specifies RDS but let's be explicit. DynamoDB cannot support the ad-hoc query patterns needed for Chronix gap analysis without a massive GSI footprint. ClickHouse: Ideal for Oraxis analytics at 50M+ events/month. Add as a read-replica analytics store at Phase 3, not Day 1 — schema is not stable enough yet.
At 864M events/day sustained (10k RPS), TimescaleDB chunk compression and S3 tiering after 90 days is critical. Without it, storage costs and query times degrade materially by month 4. Chunk compression must be tested during Phase 0 load testing.
At Phase 3, mirror enforcement_events to ClickHouse via Kafka consumer. Oraxis queries hit ClickHouse; all other modules stay on PostgreSQL. No disruptive migration required.
The PRD specifies SQS/EventBridge. However, at 10k RPS x 1 event/request, that's 864M messages/day. Standard SQS is limited to ~3,000 msg/sec without batching tricks. Kafka on MSK handles 1M+ msg/sec natively. More critically, Kafka's 7-day log retention enables forensic replay — if a new Luxion signature is deployed, you can reprocess last week's events to catch previously-missed attacks.
SQS Standard: Throughput ceiling too low; no replay. Kinesis: Valid alternative. Less Go ecosystem support than Kafka; shard management more manual. Switch to Kinesis if AWS-native tooling is a hard requirement. EventBridge: Keep for routing Nexion alerts to external systems (PagerDuty, Slack, SIEM) but not for high-throughput event ingestion.
Kafka on MSK adds operational complexity vs SQS. Partition count decisions made at cluster creation are hard to change. Set 64 partitions per topic (over-provision; cheaper than under-provisioning).
MSK Serverless removes partition management complexity at a cost premium (~2x per GB). Use MSK Serverless for MVP, evaluate dedicated MSK at Phase 2 when throughput patterns are known.
ECS Fargate task scale-out takes 45–90 seconds (new task launch + health check). At 10k RPS burst traffic, that's too slow. EKS HPA with KEDA (Kubernetes Event-Driven Autoscaling) can scale pods in <15 seconds based on Kafka consumer lag metrics — which is exactly the right signal for the event pipeline. Karpenter handles node provisioning in <60 seconds.
ECS Fargate (PRD spec): Simpler ops, but the autoscaling SLA doesn't match the product's performance requirements. Reconsider for non-critical services (Chronix report generation). ECS on EC2: Self-managed instances add overhead that Kubernetes already abstracts better.
Use ArgoCD for GitOps deployments. Helm charts for all services. Cluster creation via Terraform. Budget 1 dedicated SRE/platform engineer starting Phase 1.
Deploy a dedicated EKS cluster in eu-west-1 for EU tenants. Enforce tenant routing at API Gateway (CloudFront → regional ALB). Postgres replicated to eu-west-1 Multi-AZ. No EU data touches us-east-1 workers.
What it solves: Compliance evidence requires immutable, ordered history of every policy evaluation with full context. Traditional CRUD updates destroy this audit chain.
When worth it: From Day 1. The enforcement_events table IS an event store — append-only, never updated. This is event sourcing without the overhead of rebuilding projections for every read.
Adoption cost: Low. Design the data model correctly from the start (no UPDATE on events). The "projection" is Chronix reading the event stream to compute compliance scores.
✓ USE NOW
What it solves: Browser extension currently sends to API Gateway for policy evaluation — adds ~50ms round-trip. Edge deployment of OPA WASM bundles in the extension itself could reduce this to <5ms local evaluation.
When worth it: Phase 4, when customer latency complaints emerge or when enterprises deploy Astral for latency-sensitive real-time coding assistants.
Overkill for: MVP. Policy bundle sync in the extension adds deployment complexity; WASM bundles are 2-5MB which slows extension updates through Chrome Web Store review.
⟳ EVALUATE AT PHASE 4
What it solves: The L3 AI Judge uses a general-purpose LLM (GPT-4o/Claude Sonnet) for semantic threat analysis. A security-domain fine-tuned model (Mistral-7B + security corpus) could run on dedicated GPU nodes at <80ms — enabling inline L3.
When worth it: When monthly L3 API costs exceed $50k/month, or when customers require data-resident AI judgment (no prompts leaving tenant network).
Adoption cost: High. Requires curated training dataset, RLHF pipeline, model evaluation framework, and GPU serving infra. 3–4 months of ML team time.
⟳ EVALUATE AT $50K/MO THRESHOLD
What it solves: The policy simulator shows retroactive impact but can't predict future edge cases. Progressive policy rollout (10% of users → 25% → 100%) catches false positives before full deployment.
When worth it: Phase 2. When enterprise customers start deploying blocking policies and false positives become a customer success issue.
Implementation: LaunchDarkly SDK or self-hosted Unleash. Policy activation has a "canary" mode that routes N% of tenant traffic through the new policy.
⟳ EVALUATE PHASE 2
What it solves: Compliance report generation requires "find all events related to PHI exposure" — a semantic query, not a keyword match. pgvector on PostgreSQL enables this without a separate vector database.
When worth it: Phase 3. Add pgvector to the existing PostgreSQL instance; embed enforcement events with a lightweight model (all-MiniLM-L6-v2). Enables natural language compliance queries.
⟳ PHASE 3 WITH pgvector
What it solves: Compliance report generation is a read-heavy, long-running operation that competes with high-frequency audit event writes on the same PostgreSQL primary.
When worth it: When report generation p95 exceeds 10 seconds, or when write throughput on the primary exceeds 50k events/second. Route all Chronix reads to a PostgreSQL read replica.
Overkill for: MVP. PostgreSQL Multi-AZ read replica is sufficient for Phase 0–2.
⟳ WHEN REPORTS SLOW DOWN
Go gateway + browser extension + API proxy. Can run fully independently from Track B after shared event schema is locked (end of Week 2). Team: 2 Go engineers + 1 frontend (extension).
Next.js dashboard + policy UI + Chronix gap analysis. Depends only on API contract (OpenAPI spec locked Week 2). Mock API for development. Team: 1 fullstack + 1 frontend.
EKS, Terraform, CI/CD, MSK, RDS, Redis. Runs in parallel Week 1 onwards. Must deliver baseline environment by end of Week 2. Team: 1 platform/SRE engineer.
L3 AI Judge async service + L4 deepfake models. Starts Week 12, after core enforcement is stable. Depends on API contract for async challenge protocol. Team: 1 ML engineer.
The single thread that delays everything: The enforcement event schema and the OPA policy bundle format. Every other component depends on these two contracts — the browser extension, the API gateway, the Nexion alert feed, the Chronix compliance mapping, and the TimescaleDB schema all wire to this. If the event schema changes after Week 3, it cascades across 5 teams.
Lock the event schema and OpenAPI spec by end of Week 2. No exceptions. Use schema versioning (v1, v2) to allow additive changes without breaking existing consumers.
| Tool | Purpose | Why This | Status |
|---|---|---|---|
| OPA + Rego | Policy engine | Sub-ms eval; git-native policy-as-code; WASM compilation for edge | Stable ✓ |
| Hyperscan | Multi-pattern DLP regex | Network-speed regex; single DB compile for all PII patterns | Stable ✓ |
| TimescaleDB | Time-series audit events | PostgreSQL-compatible; chunk compression; continuous aggregates | Stable ✓ |
| Triton Inference Server | GPU model serving (L3/L4) | Dynamic batching; multi-model serving; gRPC interface for Go | Active — evaluate |
| Presidio (Microsoft) | PII/PHI entity recognition | Pretrained NER for 50+ PII types; augments regex for ML-based detection | Active — evaluate |
| Karpenter | EKS node auto-provisioning | Spot instance provisioning in <60s; GPU node pool management | Stable ✓ |
| KEDA | Kafka-driven pod autoscaling | Scale pods based on Kafka consumer lag — ideal trigger for enforcement burst | Stable ✓ |
| pgvector | Semantic search on events (Phase 3) | No separate vector DB; works in existing PostgreSQL instance | Active — evaluate |
Ship L1 signature DLP without the ML-backed L2 heuristic layer. Covers 90%+ of common PII patterns. Cleanup trigger: first false-negative customer complaint or when ML infra is available.
Agents poll for policy bundles every 30s rather than push. Adds up to 30s lag on policy activation. Cleanup trigger: customer needs sub-5s policy propagation (implement Redis pub/sub push).
Launch us-east-1 only. EU region required only for first EU enterprise customer. Cleanup trigger: first EU customer signed. Allow 6 weeks for eu-west-1 deployment.
Tempting to add tenant_id to WHERE clauses in application code. This is a trap. One missing WHERE clause leaks all tenants' data. PostgreSQL RLS enforces isolation at the DB layer — no application bug can bypass it. Must be right from Day 1.
No UPDATE or DELETE on enforcement_events. Ever. Compliance evidence requires immutable, append-only records. Even "corrections" must be new events with a reference to the corrected event.
If the team is tempted to put the LLM call in the hot path "just for now" — don't. The latency regression will be immediate, the removal will be deprioritized forever, and customers will build workflows around the slow behavior.
The enforcement_events schema is the contract between 5 services. Changing it after Week 3 is a coordination disaster. Lock it. Version it. Add fields additively only.
All inter-service communication: mTLS. All client connections: TLS 1.3 minimum. Retrofitting this in a security product is a marketing disaster, not just a technical debt.
No Phase 2 start until Phase 1 passes: P95 <50ms @ 5k RPS under k6 load test. This isn't optional. Ship a product that can't hit its stated SLA and the sales team is lying to customers.
OpenTelemetry tracing on every enforcement hop. Prometheus metrics exported. Grafana dashboards. Without this, debugging latency regressions or finding the "3am incident" is impossible.
The <50ms target for inline enforcement requires Hyperscan DLP + OPA eval + Redis rate check + async Kafka publish — all under 50ms including network overhead to the enforcement gateway. At cold-cache conditions on a new EKS pod, this won't be met. The first time a customer sees a >50ms enforcement decision and screenshots their Network tab, it's a sales problem. Mitigation: Define SLA as P95 <50ms (not mean), pre-warm pods with 5-minute JIT warm-up period, and instrument every component with latency breakdown in the Oraxis dashboard.
OQ-04 (shared vs dedicated) is marked "open question" but the architecture decision must be made in Week 1 — it affects the Postgres schema, the Kafka topic structure, the Cognito pool design, and the AWS account strategy. If it's answered wrong (shared when a customer needs dedicated), migration is a 6-month effort. Recommend: ship shared multi-tenant, sell dedicated as Enterprise tier at 2x price, budget the migration cost into the enterprise deal.
Stellix auto-classifying AI systems as "Minimal Risk" when they should be "High Risk" creates legal liability for the customer — and potentially for Zorianto as the tool enabling that misjudgment. The PRD frames this as a discovery feature, but it will be used by compliance officers as a compliance decision. Every auto-classification must have a "pending legal review" state and a prominent disclaimer. Build the human-confirmation workflow into the Stellix UI before any customer uses the EU AI Act features in production.