Keeping AI Personalization Steady Across Clouds

Practical notes on building a recommendation engine that stays reliable while teams iterate.

Mustafa DwaikatTechnical Lead & Solution ArchitectJanuary 15, 2025

Leads cross-functional teams to ship AI-enabled platforms, resilient cloud systems, and scalable developer experience programs.

When we set out to build a personalization engine for a news and video platform that spans three clouds, success was not defined by a shiny demo. We needed measurable lifts in engagement, zero regression to availability, and a delivery process the organisation could sustain once the core team rolled off. Here is what worked, what hurt, and what I would do again.

Start with the operational contract

The earliest sketch was not a diagram; it was a one-pager outlining latency budgets, fallback behaviour, governance guardrails, and operational ownership. Shipping an ML system without those agreements is an invitation to scope creep and midnight incidents.

Latency: sub-150 ms P95 for recommendations, including feature enrichment, so we defined a cache hierarchy (edge + service-level) before writing the first line of code.
Fallbacks: degrade gracefully to editorially curated collections whenever upstream data freshness drops below SLA or scoring services misbehave.
Ownership: product analytics owned instrumentation, platform engineering owned orchestration and SRE, editorial leadership signed off on the human-in-the-loop workflow.

Locking this down gave every stakeholder clarity: any idea that violated the contract was either refined or parked.

Architect for change, not for a single model

We chose an event-driven spine so behavioural signals and content metadata remained decoupled. Core building blocks:

Unified event bus (Kafka on AWS) ingesting user interactions, video completion rates, and newsroom signals.
Feature pipelines orchestrated in GCP with Dataflow and BigQuery, which made experimentation accessible for data scientists already living in that ecosystem.
Model serving through Azure Machine Learning endpoints, wrapped in a lightweight Nest.js service providing AB-testing, rollout toggles, and observability hooks.
Feature store abstraction so a model swap becomes a configuration change, not a rewrite.

Cross-cloud was a deliberate bet: each team leveraged the tooling they knew best, and we stitched them together with strict interface contracts and zero trust networking. It added complexity, but also resilience-no single vendor outage could sink the system.

Make telemetry non-negotiable

We funded observability as a first-class feature:

Tracing (OpenTelemetry + Tempo) across ingestion, feature computation, and inference so we could follow a recommendation from click to response.
Structured business metrics (engagement lift, time-on-site, CPM impact) in a dedicated Looker dashboard refreshed hourly.
Model health reports covering drift, fairness, and opt-out rates. These fed a weekly governance review with product, legal, and editorial.

Because telemetry shipped with the first iteration, we caught a misbehaving feature pipeline within hours-the cache was flooding with stale scores after a schema change. Rollback was instant, customer impact negligible.

Delivery patterns that scaled the team

Cross-disciplinary work falls apart without rituals that keep people aligned. A few that stuck:

Integration spikes every sprint: rather than waiting for a "hardening" phase, we pulled one cross-service integration into each sprint review. Friction surfaced early and was fixed while context was fresh.
Decision logs: short, timestamped notes in Notion capturing context, options, and the chosen path. When teams rotated, onboarding time fell by ~40%.
Playbooks for deploy, rollback, and runbooks for the support desk. No on-call engineer had to guess how the system behaved under failure.

Guardrails for responsible AI

Personalization without governance is a reputational risk. We added human checkpoints and auditability from day one:

Editorial teams could pin or blacklist content, with full traceability.
Models were retrained only after bias evaluation against geography and language cohorts.
Every API call logged the decision factors (top features) to a tamper-proof store, making compliance reviews straightforward.

What I would repeat

Define the contract before the architecture. It keeps everyone honest and accelerates trade-off decisions.
Invest in shared tooling. Cross-cloud works if you make integration effortless and observability universal.
Let governance live with the product. AI needs product, legal, and editorial viewpoints, not just engineering enthusiasm.

If you are about to operationalise your own recommendation system, start by writing the operational contract. Everything else-architecture, tooling, headcount-flows more cleanly from there.