AI Career Intelligence Hub

AWS SageMaker MLOps Guide

GOVERNANCE→

⚙️ Technical Program Manager

Technical Program Manager (AI / ML / GenAI)

Deep technical + delivery ownership. LLMs, pipelines, APIs, microservices. Bridges engineering and product. Strongest target given DevSecOps + multi-cloud + 11yr background.

WHAT IT IS

A Technical Program Manager for AI combines deep engineering knowledge with program delivery. You own LLM pipelines, API integrations, cloud infrastructure delivery, and system architecture — while managing timelines, dependencies, and engineering teams. You are the person who can draw the architecture diagram AND run the sprint planning session.

LLM Pipeline OwnershipArchitecture ReviewsAPI ManagementCI/CD for MLDevSecOpsMulti-Cloud

WHY IT EXISTS

AI systems are complex enough that business-only PMs miss critical technical risks — a model that performs well offline but fails in production, a RAG pipeline with unacceptable latency, a cloud bill 10× over budget. The AI TPM bridges the gap so engineering velocity stays high while delivery accountability is maintained.

TPMs catch architecture risks before they become sprint-blocking bugs
Directly your background: DevSecOps + AWS/Azure/GCP + 11 years of Fortune 500 delivery = textbook TPM profile
Commands $155K–$235K — premium over non-technical PMs due to engineering credibility

WHO YOU WORK WITH

ML / LLM Engineers — your delivery team, working at the model and pipeline level
DevOps / Platform Engineering — who deploy the infrastructure you're scheduling
Product Managers — translating business requirements into technical acceptance criteria
Security / Compliance — especially critical with your clearance background
Engineering Managers — resource planning and performance context

HOW TO EXECUTE

Technical design reviews: Lead architecture reviews to catch risks early — your DevSecOps background is a force multiplier here
Architecture Decision Records (ADRs): Document every significant technical decision with context and tradeoffs
Dependency graphs: Map all technical dependencies — data pipelines, model serving infra, API contracts
Engineering sprint ownership: Run sprint planning with enough technical depth to catch mis-estimates
Production readiness reviews: Gate every deployment with security, performance, and monitoring checklists

BEST PRACTICES

Maintain technical depth: Stay hands-on enough to code-review critical path items — "T-shaped" knowledge is your superpower
Build blameless postmortems into your culture — ML incidents are learning opportunities, not blame events
Track tech debt explicitly: ML tech debt compounds faster than software tech debt — maintain a visible backlog
Automate your own reporting: Build dashboards, not slide decks — real-time status over weekly status calls
Clearance as differentiator: Lead with U.S. Secret Clearance on every application — defense TPM roles pay 20–35% premium

Core Platform Docs

CI/CD, model registry, pipelines

AWS→

AWS Bedrock Developer Guide

Foundation models, agents, RAG

AWS→

Azure ML Model Management & Deployment

MLOps lifecycle on Azure

AZURE→

Vertex AI Pipelines Reference

Orchestrated ML pipelines on GCP

GCP→

GitHub Repos

GokuMohandas/Made-With-ML

Production ML lifecycle from design → deploy

⭐ 37k→

chiphuyen/machine-learning-systems-design

27 open-ended ML systems design questions

⭐ 8.5k→

📺 Watch & Learn

▶

What is MLOps? — IBM Technology

IBM Technology~8 min

MLOPS→ ▶

AWS SageMaker MLOps Tutorial — End-to-End

AWS Events / freeCodeCamp~30 min

AWS→ ▶

ML Systems Design — Chip Huyen (Stanford)

Stanford / Chip Huyen~60 min

DESIGN→

AI Product & Strategy Roles

3 roles · 18+ links

🚀 AI Product Manager

AI Product Manager

Defines AI product vision, roadmap, and monetization strategy. Owns user experience + AI capability alignment. Interfaces with engineering, design, and business stakeholders.

WHAT IT IS

An AI Product Manager defines the product vision, roadmap, and success metrics for AI-powered products. Unlike a traditional PM, they must also understand model behavior, training data requirements, evaluation criteria, and how AI capability gaps translate into user experience failures.

Product RoadmapMonetization StrategyUser ResearchModel Performance SpecsGo-to-Market

WHY IT EXISTS

AI features fail when engineers build what they can instead of what users need. The AI PM translates ambiguous business goals into precise model requirements, defines the acceptable failure rate for AI decisions, and ensures the user experience degrades gracefully when the model is uncertain.

AI products without a PM ship features users don't trust or use — regardless of model accuracy
AI PMs define when "good enough" is good enough — preventing infinite model tuning cycles
Critically: someone must own the feedback loop from user behavior back to model retraining

WHO YOU WORK WITH

UX/Design — translating AI capability into trust-building user interfaces
Data Scientists — defining model requirements and evaluation criteria
Engineering — scoping feasibility and managing tradeoffs
Sales & Marketing — positioning AI features and managing customer expectations
Legal / Privacy — AI-specific data usage and disclosure requirements

HOW TO EXECUTE

AI-specific PRDs: Include model performance thresholds, acceptable error rates, fallback behavior, and data requirements
User research for AI: Test not just usability but trust — how do users react when the model is wrong?
Feature prioritization: MoSCoW adapted for ML — separate "model must achieve X" from "feature ships when X"
Beta frameworks: Staged rollouts with human-in-the-loop for high-stakes AI decisions

BEST PRACTICES

Define success before building: "Model achieves 90% precision on fraud detection" must be written before training starts
A/B test AI outputs, not just UX — compare model versions on real traffic, not just offline benchmarks
Build feedback loops early: Thumbs up/down, corrections, and implicit signals feed your next retraining cycle
Communicate uncertainty to users: "AI-generated, may be inaccurate" builds more trust than pretending certainty
Treat AI latency as a product requirement — 2-second response time is a feature, not a nice-to-have

Cloud AI Services

AWS AI Services Overview

Full catalog: Rekognition, Comprehend, Textract, Bedrock

AWS→

Azure AI Studio

Build, evaluate, and deploy AI apps

AZURE→

GCP Vertex AI Platform

Unified ML platform on Google Cloud

GCP→

Learning Repos

microsoft/generative-ai-for-beginners

18-lesson GenAI app development course

⭐ 75k→

microsoft/ML-For-Beginners

12-lesson ML fundamentals for PMs and engineers

⭐ 72k→

Product Strategy Frameworks

CD for ML — Martin Fowler Google PAIR Guidebook HBR Leading with AI

📺 Watch & Learn

▶

What is Generative AI? — IBM Technology

IBM Technology~9 min

GENAI→ ▶

AI Product Manager — Skills & Roadmap 2024

Google PAIR — Human-Centered AI Design

PM ROLE→ ▶

Google Developers~20 min

DESIGN→

🤖 GenAI Product Manager

GenAI Product Manager

Focus on LLM apps, copilots, RAG systems. Drives prompt strategy, evaluation pipelines, cost optimization, and LLM quality frameworks at enterprise scale.

WHAT IT IS

A GenAI Product Manager specializes in products built on foundation models — copilots, agents, RAG-powered search, and LLM pipelines. The role combines classic product management with deep LLMOps knowledge: prompt strategy, evaluation frameworks, token cost management, and hallucination mitigation.

LLM App StrategyPrompt EngineeringRAG ArchitectureEval PipelinesCost GovernanceAgent Design

WHY IT EXISTS

GenAI products fail in uniquely dangerous ways — hallucination, prompt injection, runaway token costs, and model behavior drift. A dedicated GenAI PM exists to manage these failure modes systematically, ensuring products are reliable, cost-efficient, and safe enough to deploy at enterprise scale.

Without cost governance, LLM apps can exceed compute budgets by 10–50× at scale
Prompt drift — model updates silently break product behavior — requires PM-owned eval suites
GenAI PM is the fastest-growing PM specialization in 2025–2026 with 3× salary premium over traditional PM

WHO YOU WORK WITH

LLM / Prompt Engineers — your core technical team building and tuning the AI layer
Data Engineers — building the RAG knowledge bases and vector pipelines
UX Researchers — studying how users interact with generative outputs and build trust
Legal / Privacy — GenAI introduces copyright, hallucination liability, and data residency risks
Finance — token cost per user is a unit economics concern requiring PM ownership

HOW TO EXECUTE

Treat prompts as code: Version in Git, review in PRs, test in CI/CD — prompt changes are product changes
Build eval suites before launch: RAGAS, DeepEval, or custom evals that test factuality, safety, and task completion
Token budget management: Set per-user, per-feature cost targets — track in real-time dashboards
Human-in-the-loop design: Identify where model confidence is low and route to human review automatically

BEST PRACTICES

Never ship without evals: Every GenAI feature needs an automated test suite that runs on every deploy
Monitor hallucination rate in production — not just in your eval set. Real user queries will find model gaps your test set didn't
Cost per query is a product metric — track it alongside engagement and satisfaction metrics
Build system prompts with least privilege: Only give the model the context it needs — smaller context = lower cost + lower injection risk
Fallback gracefully: Define what happens when the model is unavailable or returns low-confidence responses

Cloud GenAI Platforms

AWS Bedrock Developer Guide

Foundation models, agents, knowledge bases

AWS→

AWS Bedrock Agents

Agentic workflows with tool use

AWS→

Azure OpenAI Service

GPT-4o, o1, embeddings on Azure

AZURE→

GCP Generative AI Studio

Gemini, PaLM, fine-tuning on GCP

GCP→

LLM Frameworks

LangChain Documentation

LLM chains, agents, RAG, memory

LANGCHAIN→

LlamaIndex Docs

RAG framework, data connectors, vector stores

LLAMAINDEX→

Pinecone Vector DB Docs

Managed vector search at scale

PINECONE→

Chroma DB Documentation

Open-source embedding database

CHROMA→

Weaviate Vector DB Docs

AI-native vector database

WEAVIATE→

Evaluation Tools

mlflow/mlflow — LLM Evaluate

LLM experiment tracking + prompt eval

⭐ 18.9k→

confident-ai/deepeval

LLM evaluation framework — 5k+ ★

⭐ 5k→

📺 Watch & Learn

▶

What is RAG? — IBM Technology

IBM Technology~6 min

RAG→ ▶

LangChain RAG Tutorial — Python 2024

LangChain / freeCodeCamp~45 min

HANDS-ON→ ▶

Vector Databases Explained — Embeddings & Search

YouTube Search~12 min

VECTOR DB→ ▶

LLM Evaluation Frameworks — DeepEval Tutorial

AWS Cloud Adoption Framework

EVAL→

📊 AI Strategy Lead

AI Strategy Lead / Head of AI Programs

Enterprise AI roadmap ownership, transformation strategy, budget control, and C-suite/board alignment. Often involves P&L ownership and hiring authority.

WHAT IT IS

The AI Strategy Lead sets enterprise-wide AI direction, manages the AI investment portfolio, defines AI governance policy, and aligns AI initiatives to business strategy. This is a senior leadership role — often VP-level — with direct budget authority and board-level accountability for the organization's AI future.

Enterprise AI RoadmapAI GovernanceP&L OwnershipVendor StrategyBuild vs BuyBoard Reporting

WHY IT EXISTS

Without strategic AI leadership, organizations waste millions on disconnected experiments, miss regulatory deadlines, and get disrupted by competitors who move faster. The AI Strategy Lead ensures AI investment is coherent, governed, and compounding — not scattered across 50 disconnected POCs that never scale.

Companies with a dedicated AI strategy function deploy 4× more AI at scale than those without
Regulatory pressure (EU AI Act, NIST RMF) demands enterprise-level AI accountability that no single team can provide
AI transformation failures are almost always strategic (wrong priorities, no governance) not technical

WHO YOU WORK WITH

CEO / CTO / CDO — your primary stakeholders and budget holders
Board of Directors — quarterly AI risk and opportunity briefings
Business Unit Presidents — AI use case identification and ROI accountability
Chief Risk / Legal / Compliance Officers — AI governance and regulatory alignment
Cloud Partners (AWS, Azure, GCP) — enterprise agreements and strategic partnerships

HOW TO EXECUTE

AI maturity assessment: Baseline where the org is — data quality, talent, infrastructure, governance — before building strategy
Portfolio prioritization: Score use cases on business impact × technical feasibility × data availability
Governance council: Cross-functional AI ethics and risk council meeting monthly
Build-vs-buy framework: Systematic evaluation criteria for AI vendors vs. custom builds vs. open source
Regulatory mapping: Map all AI systems to EU AI Act risk tiers, NIST RMF functions, and sector-specific requirements

BEST PRACTICES

Start with 3 high-ROI use cases — deep wins build credibility and fund the platform investment for scale
Establish a Center of Excellence (CoE) in Year 1 — centralize MLOps tooling, shared infrastructure, and talent development
AI risk register: Every AI system in production must have a named owner, risk classification, and review date
Measure AI maturity quarterly using a consistent framework (CMMI for AI or equivalent) — progress is your budget justification
60% of AI POCs should not scale — define kill criteria before starting and enforce them ruthlessly

Adoption Frameworks

Strategic cloud + AI adoption roadmap

AWS→

Azure Cloud Adoption Framework

Enterprise CAF with AI workload tracks

AZURE→

GCP Cloud Adoption Framework

Business, people, technology pillars

GCP→

Standards & Regulations

NIST AI Risk Management Framework

AI RMF 1.0 — governance standard

NIST→

EU AI Act Regulatory Framework

Risk-based AI regulations effective 2026

EU→

ISO/IEC 42001 AI Management System

First certifiable international AI standard

ISO→

📺 Watch & Learn

▶

Enterprise AI Strategy — Building the Roadmap

EU AI Act Explained — Compliance & Obligations

STRATEGY→ ▶

YouTube Search~18 min

REGULATION→ ▶

NIST AI RMF — GOVERN MAP MEASURE MANAGE

Pipelines, model registry, CI/CD for ML

NIST→

MLOps, LLMOps & AI Operations Leadership

3 roles · 25+ links

⚙️ MLOps Program Manager

MLOps Program Manager

Owns model lifecycle management, CI/CD for machine learning, production monitoring, and drift detection. Bridges data science output with platform engineering delivery.

WHAT IT IS

The MLOps Program Manager operationalizes machine learning — building and managing the CI/CD pipelines, model registries, monitoring systems, and retraining workflows that keep ML models reliable in production. This role applies DevOps discipline to the unique complexity of ML systems, bridging the gap between data science experimentation and production engineering.

CI/CD for MLModel RegistryDrift DetectionRetraining AutomationFeature StoresSageMaker / Vertex AI

WHY IT EXISTS

87% of ML models never reach production. Of those that do, most degrade silently within months due to data drift, concept drift, or infrastructure failures. The MLOps Program Manager exists to close this gap — systematizing the path from notebook to production and keeping models reliable after deployment.

Model drift is invisible without monitoring — models can fail silently for weeks before humans notice
Without MLOps, every redeployment is a manual, error-prone 2–4 week process
MLOps teams deploy models 50× more frequently than manual counterparts

WHO YOU WORK WITH

Data Scientists — consuming their models and making them production-ready
ML Engineers — building the serving infrastructure and pipelines you manage
Platform / DevOps Engineers — providing the underlying K8s and cloud infrastructure
Data Engineers — ensuring clean, consistent features reach models in production
Security / Compliance — model governance and audit trails for regulated industries

HOW TO EXECUTE

Model CI/CD pipeline: Automated training → evaluation → staging → production gates triggered by data or code changes
Model registry: Central catalog of all models with version history, performance metrics, and owner info (MLflow, SageMaker Model Registry)
Monitoring suite: Data drift (Evidently AI), model quality (Arize, Fiddler), infrastructure (Prometheus/Grafana)
Retraining triggers: Automated retraining when drift score exceeds threshold — no manual intervention required
Rollback procedures: One-click rollback to previous model version with automatic traffic cutover

BEST PRACTICES

Treat models like microservices: Same deployment discipline — versioning, health checks, canary releases, circuit breakers
Version data AND models together — a model is only reproducible if you can re-create the exact training dataset (use DVC)
Test for bias in CI/CD: Fairness checks should be automated gates, not one-time audits
Shadow mode deployment: Run new models in shadow mode before serving real traffic — compare outputs without user impact
SLA-driven monitoring: Define model performance SLAs (latency P99, accuracy floor, uptime) and alert before they breach

AWS MLOps Stack

SageMaker MLOps Guide

AWS→

SageMaker Pipelines

Orchestrate ML training workflows

AWS→

SageMaker Model Monitor

Data + model drift detection

AWS→

Azure MLOps Stack

Azure ML Model Management

Versioning, packaging, deployment

AZURE→

Azure ML Pipelines

Reusable ML workflow orchestration

AZURE→

Azure ML Model Monitoring

Data drift, model quality signals

AZURE→

GCP MLOps Stack

Vertex AI Pipelines

Kubeflow/TFX-based orchestration

GCP→

Vertex AI Model Registry

Centralized model management

GCP→

Open Source Tools

mlflow/mlflow

Experiment tracking, model registry, deployment

⭐ 18.9k→

iterative/dvc — Data Version Control

Git for ML models and datasets

⭐ 14k→

zenml-io/zenml

Production MLOps framework — cloud-agnostic

⭐ 4.5k→

DataTalksClub/mlops-zoomcamp

Free 9-week MLOps bootcamp

⭐ 11.2k→

GokuMohandas/Made-With-ML

Production ML — design → deploy → monitor

⭐ 37k→

📺 Watch & Learn

▶

MLOps Full Course — freeCodeCamp (3 hrs)

freeCodeCamp3 hr

FULL COURSE→ ▶

MLflow Tutorial — Experiment Tracking & Model Registry

YouTube Search~30 min

MLFLOW→ ▶

DVC — Data Version Control for ML Projects

Iterative.ai~20 min

DVC→ ▶

Andrew Ng — MLOps Specialization Intro (DeepLearning.AI)

DeepLearning.AI~8 min

ANDREW NG→

🧠 LLMOps Program Manager

LLMOps / GenAI Program Manager

RAG pipeline ownership, vector DB strategy, prompt lifecycle management, LLM gateway design, evaluation frameworks, and cost governance at enterprise scale. 🔥 HOT ROLE 2026

WHAT IT IS

LLMOps is the operational discipline for large language models in production. The LLMOps Program Manager owns the infrastructure and processes that keep LLM applications reliable, cost-efficient, and continuously improving: RAG pipelines, vector databases, prompt versioning, LLM gateway management, evaluation automation, and token cost governance.

RAG PipelinesVector DBsPrompt VersioningLLM GatewayEval AutomationToken CostAgent Orchestration

WHY IT EXISTS

LLMs fail in production differently from classical ML. Hallucination rates, prompt injection vulnerabilities, token cost overruns, and silent model version changes create new operational risks. LLMOps exists to systematize this complexity — giving engineers clear processes for deploying, monitoring, and improving LLM systems without manual heroics.

LLM production costs can spike 100× overnight due to prompt inefficiency or unexpected traffic patterns
Model provider updates (GPT-4 → GPT-4o) can silently break product behavior without LLMOps monitoring
RAG pipeline hallucination rates are a product quality metric that requires continuous measurement infrastructure

WHO YOU WORK WITH

LLM / Prompt Engineers — building and tuning the models and prompts you operate
Data Engineers — building the ingestion pipelines that feed RAG knowledge bases
Security — prompt injection monitoring and guardrail architecture
Finance / FinOps — token cost management and LLM cost allocation
Product Managers — translating LLM performance metrics into product requirements

HOW TO EXECUTE

LLM gateway first: Deploy LiteLLM or similar to centralize provider routing, cost tracking, and rate limiting before anything else
Prompt-as-code: All prompts in Git with versioning, semantic diff reviews, and automated regression tests on every change
RAG eval pipeline: RAGAS metrics (faithfulness, context precision, answer relevancy) running on every RAG pipeline deploy
Observability stack: Langfuse or Arize for trace-level visibility into every LLM call in production
Token budget enforcement: Per-user and per-feature token limits with automatic alerts at 80% of budget

BEST PRACTICES

Monitor every LLM call in production — trace inputs, outputs, latency, cost, and model version. No visibility = no reliability
Test new model versions in shadow mode before routing production traffic — model provider updates break things silently
Chunk size matters in RAG: 512 tokens is not always optimal — test retrieval quality against your actual query distribution
Implement input and output guardrails from day 1 — not as an afterthought after an incident
Build a retrieval feedback loop: Track which retrieved chunks are actually used — remove noise from your vector DB over time

LLM Orchestration Frameworks

LangChain Documentation

Chains, agents, memory, RAG, tools

LANGCHAIN→

LlamaIndex Docs

Data framework for LLM apps

LLAMAINDEX→

Microsoft AutoGen

Multi-agent conversation framework

AUTOGEN→

CrewAI Documentation

Role-based AI agent orchestration

CREWAI→

Vector Databases

Pinecone — Managed Vector DB

Serverless vector search, metadata filtering

PINECONE→

Qdrant Vector DB Docs

High-performance, self-hostable vector search

QDRANT→

Weaviate AI-Native Vector DB

Hybrid search, multimodal, generative

WEAVIATE→

Milvus Vector Database

Open-source, cloud-native, billion-scale

MILVUS→

GitHub Repos — LLMOps

tensorchord/Awesome-LLMOps

Curated LLMOps tools — gateways, eval, serving

⭐ 5.8k→

bentoml/OpenLLM

Run LLMs as OpenAI-compatible endpoints

⭐ 12.2k→

huggingface/text-generation-inference

Production TGI server for LLMs

⭐ 9.5k→

BerriAI/litellm — LLM Gateway

Unified interface for 100+ LLM providers

⭐ 14k→

Observability

langfuse/langfuse — LLM Observability

Open-source LLM tracing, monitoring, eval

⭐ 7k→

Arize AI — ML + LLM Observability

Monitor, troubleshoot production AI

ARIZE→

📺 Watch & Learn

▶

LLMOps: Building LLM Apps for Production — DeepLearning.AI

DeepLearning.AI~2 hr

COURSE→ ▶

RAG From Scratch — LangChain Tutorial

LangChain~60 min

RAG→ ▶

LiteLLM Gateway — Proxy 100+ LLM Providers

Langfuse — LLM Observability & Tracing Tutorial

GATEWAY→ ▶

Langfuse~20 min

OBSERVABILITY→

🏗️ AI Platform Manager

AI Platform Manager

Owns AI infrastructure: LLM gateways, vector databases, inference pipelines, compute management, and developer tooling. Works alongside platform engineering teams.

WHAT IT IS

The AI Platform Manager owns the internal developer platform that AI and ML teams build on — LLM serving infrastructure, vector databases, compute scheduling, model serving APIs, experiment tracking, and developer tooling. The goal: reduce time-to-production for AI teams from weeks to hours through self-service platform capabilities.

LLM Serving InfraGPU SchedulingDeveloper PlatformService CatalogCost AllocationSLA Management

WHY IT EXISTS

Without a managed AI platform, every ML team spends 60–70% of their time on infrastructure instead of models. They reinvent the wheel — setting up the same Kubernetes clusters, monitoring stacks, and serving infrastructure repeatedly. The AI Platform Manager creates the shared foundation that multiplies engineering velocity across the entire AI organization.

Platform teams reduce ML infrastructure cost 40–60% through shared compute and standardized tooling
Self-service platforms cut time-to-first-model-in-production from 6 weeks to 3 days
Critical in your profile: your DevSecOps background maps directly to secure-by-default platform design

WHO YOU WORK WITH

ML / LLM Engineers — your primary customers; build for their velocity and trust
Platform / Infrastructure Engineers — who build the underlying K8s, networking, and storage layers
Security / CISO — embedding security controls into the platform (your clearance is a differentiator here)
FinOps / Finance — GPU compute is expensive; you own cost allocation and optimization
Data Scientists — consume your platform via notebooks, pipelines, and experiment tracking

HOW TO EXECUTE

Service catalog: Catalog every platform capability (LLM endpoints, vector DBs, training clusters) with self-service provisioning
Kubernetes-first: All AI workloads containerized and orchestrated — enables portability and scaling
GPU cost management: Spot instances for training, reserved for inference — automated scaling policies
Developer experience: Time-to-first-deployment under 30 minutes is your primary platform KPI
Observability by default: Every workload auto-instrumented with Prometheus + Grafana on provisioning

BEST PRACTICES

Build for self-service: If teams need to file a ticket to get infrastructure, your platform isn't done yet
Cost attribution tags on everything: Every compute resource tagged to team, project, and model — FinOps visibility is non-negotiable
Platform SLAs: Define and publish uptime, latency, and support SLAs — treat platform engineering like a product
Security by default: Zero-trust network policies, RBAC, secrets management (HashiCorp Vault), and audit logging baked into every template
Multi-cloud portability: Avoid deep vendor lock-in — abstract cloud-specific services behind platform APIs

Cloud AI Architecture Centers

AWS ML Architecture Center

Reference architectures for ML systems

AWS→

Azure AI/ML Architecture Reference

Patterns for AI workloads on Azure

AZURE→

GCP AI/ML Architecture Patterns

MLOps, batch, streaming AI on GCP

GCP→

Platform Tools

jina-ai/jina — Cloud-Native AI

Multimodal AI serving, neural search

⭐ 21.8k→

ray-project/ray — Distributed AI

Scale ML/LLM workloads, RayServe

⭐ 34k→

apache/airflow — Workflow Orchestration

Schedule and monitor ML workflows

⭐ 36k→

Kubernetes Docs — AI Workloads

Container orchestration for ML at scale

K8S→

Kubeflow — ML on Kubernetes

ML toolkit for Kubernetes

KUBEFLOW→

📺 Watch & Learn

▶

Kubernetes for ML/AI Workloads — Platform Engineering

TechWorld with Nana~25 min

K8S→ ▶

Ray Serve — Distributed LLM Serving at Scale

Anyscale~30 min

RAY→ ▶

AI Platform Architecture — Cloud-Native Design Patterns

YouTube Search~40 min

ARCHITECTURE→

AI Governance, Risk & Security Roles

3 roles · 30+ links UPDATED 2026

🛡️ AI Security Program Manager

AI Security Program Manager

Protects AI systems from prompt injection, data leakage, model inversion, and supply chain attacks. Your Secret Clearance = extreme premium in defense/gov sectors. MITRE ATLAS now covers 15 tactics, 66 techniques (Oct 2025 update).

WHAT IT IS

The AI Security Program Manager owns the security posture of an organization's AI and ML systems — protecting against prompt injection (OWASP LLM01), data leakage (LLM02), model inversion, training data poisoning, supply chain attacks, and adversarial examples. This role applies MITRE ATT&CK discipline to AI-specific threat vectors using the MITRE ATLAS framework (15 tactics, 66 techniques as of Oct 2025).

MITRE ATLASOWASP LLM Top 10Prompt InjectionRed TeamingSecret Clearance ★Detection Engineering

WHY IT EXISTS

In 2026, 97% of organizations reported GenAI security incidents. AI-specific attacks — prompt injection, training data poisoning, model extraction — don't appear in traditional security playbooks. The AI Security PM bridges the gap between the SOC team that knows security and the ML team that knows AI, creating defenses tailored to the unique attack surface of intelligent systems.

Prompt injection is #1 OWASP LLM risk — just 5 crafted documents can manipulate AI responses 90% of the time via RAG poisoning
The DeepSeek security breach (Jan 2026) exposed a $670K average cost increase for AI-related breaches
Your Secret Clearance makes you immediately eligible for defense/IC AI security roles paying $200K–$280K+

WHO YOU WORK WITH

SOC / Detection Engineers — extending SIEM rules to cover AI-specific TTPs from MITRE ATLAS
Red Teams — running adversarial tests against LLMs using PyRIT, Garak, and custom prompts
ML / LLM Engineers — building guardrails, input sanitization, and output filtering into the pipeline
CISO / GRC Teams — mapping AI risks to regulatory frameworks (NIST AI RMF, EU AI Act)
DoD / IC Agencies — if cleared, you'll interface directly with government security stakeholders

HOW TO EXECUTE

MITRE ATLAS threat modeling: Map every AI system against the 15 ATLAS tactics — identify which techniques are unmitigated
OWASP LLM Top 10 assessment: Run structured assessment of all LLM-facing surfaces against the 2025 list
Red team exercises: PyRIT for automated adversarial testing, Garak for LLM vulnerability scanning
Guardrail architecture: Input validation + output filtering + content moderation at every LLM boundary
Incident response playbooks: AI-specific runbooks for prompt injection, model theft, and training data poisoning incidents

BEST PRACTICES

Treat every prompt as untrusted input — apply input sanitization before reaching the model, no exceptions
Least privilege for AI agents: Agents should only access tools and data required for the specific task — no ambient authority
Scan the model supply chain: Audit every pre-trained model, fine-tuning dataset, and third-party plugin for backdoors
Integrate AI security into CI/CD: Automated security scans (garak, custom injection tests) as pipeline gates before deployment
Build a detection layer for ATLAS TTPs: Map AI-specific attack techniques to SIEM detection rules — extend your existing Detection-as-Code practice

Core Security Frameworks UPDATED OCT 2025

OWASP Top 10 for LLM Apps 2025

LLM01 Prompt Injection → LLM10 Unbounded Consumption

OWASP→

OWASP GenAI Security Project

Agentic AI Top 10 (ASI prefix) — 2026

OWASP→

MITRE ATLAS Framework

15 tactics · 66 techniques · 46 sub-techniques · AI-specific TTPs

MITRE→

NIST AI EO — Safe & Trustworthy AI

Executive order implementation guidelines

NIST→

NIST AI RMF Playbook

GOVERN · MAP · MEASURE · MANAGE

NIST→

Cloud AI Security

AWS Security Documentation Hub

IAM, GuardDuty, SecurityHub, Macie

AWS→

AWS Responsible AI Hub

Governance tools for AI/ML workloads

AWS→

Azure Security Documentation

Defender for AI, Sentinel, Purview

AZURE→

Azure OpenAI Content Filtering

Harm detection for LLM outputs

AZURE→

GCP Security Command Center

AI threat detection and response

GCP→

AI Security GitHub Repos 2026

requie/LLMSecurityGuide

OWASP GenAI Top 10 · ATLAS · Red-team tools · Feb 2026

2026→

ottosulin/awesome-ai-security

Curated AI security resources, MCP security, tools

CURATED→

AI-ML-Free-Resources-for-Security-and-Prompt-Injection

Pentesting roadmap, injection techniques

PENTEST→

Azure/PyRIT — Python Risk ID Tool

Automated red teaming for AI systems

⭐ 2.1k→

leondz/garak — LLM Vulnerability Scanner

LLM vulnerability detection framework

⭐ 4.5k→

Detection Engineering

MITRE ATLAS DevSecOps Guide 2026

14 new agent-focused techniques — AI agent context poisoning

GUIDE→

AI Village — DEF CON AI Security

Hands-on AI security challenges and research

COMMUNITY→

📺 Watch & Learn

▶

OWASP LLM Top 10 — Prompt Injection Explained

OWASP / YouTube Search~20 min

OWASP→ ▶

MITRE ATLAS — AI Adversarial Framework Walkthrough

LLM Red Teaming — PyRIT & Garak Demo

ATLAS→ ▶

Microsoft / YouTube Search~30 min

RED TEAM→ ▶

Prompt Injection Attacks & Defenses — Deep Dive

YouTube Search~18 min

DEFENSE→ ▶

AI Security & DevSecOps — Threat Modeling for ML

YouTube Search~35 min

DEVSECOPS→

📜 AI Governance Manager

AI Governance Manager

Develops and enforces policies for ethical AI, regulatory compliance, model risk management, and AI transparency reporting across the enterprise AI lifecycle.

WHAT IT IS

The AI Governance Manager owns the policies, processes, and accountability structures that ensure AI systems are developed and deployed responsibly. This includes ethics frameworks, model risk management, bias auditing, explainability standards, AI inventory management, and compliance with emerging AI regulations including the EU AI Act and ISO/IEC 42001.

AI Ethics PolicyModel Risk ManagementBias AuditingAI InventoryEU AI ActISO 42001

WHY IT EXISTS

Unmanaged AI creates legal liability, regulatory fines up to €35M under the EU AI Act, and reputational damage that erases years of brand equity. The AI Governance Manager exists to ensure every AI decision is documented, accountable, and defensible — protecting the organization from the growing wave of AI-specific regulation.

EU AI Act is in full effect August 2026 — high-risk AI violations carry fines up to €35M or 7% of global revenue
Financial regulators (OCC, Fed, CFPB) are requiring explainable AI for credit and fraud decisions
HIPAA AI guidance: AI-assisted clinical decisions require audit trails and human oversight documentation

WHO YOU WORK WITH

General Counsel / Legal — mapping AI capabilities to legal obligations and liability exposure
Chief Risk Officer — integrating AI risk into the enterprise risk management framework
Data Scientists & ML Engineers — embedding governance requirements into the model development lifecycle
Board / Audit Committee — AI governance reporting and accountability
External Regulators — increasingly, direct engagement with EU AI Office, FTC, and sector-specific agencies

HOW TO EXECUTE

AI risk classification: Classify every AI system by EU AI Act risk tier (unacceptable / high / limited / minimal) before deployment
Model cards: Mandatory documentation for every production model — intended use, limitations, training data, eval results, known biases
Fairness audits: Regular bias assessments using AIF360 or Fairlearn, covering protected attributes relevant to the use case
Governance council: Monthly cross-functional reviews of AI risk, incidents, and new system deployments
AI impact assessments: Pre-deployment assessments for high-risk AI — documented evidence for regulatory inspection

BEST PRACTICES

Governance-as-code: Automate bias checks, explainability reports, and model card generation in your CI/CD pipeline
AI inventory first: You cannot govern what you cannot see — maintain a complete, always-current catalog of all AI systems in use
Proportional governance: Apply heavy oversight to high-risk AI (hiring, lending, healthcare) and lightweight process to minimal-risk AI (autocomplete)
Document the "why" of model decisions: Not just model performance — the business rationale for why the model was built and deployed matters for auditors
Build governance into onboarding: Every new AI project should complete a governance checklist before data access is granted

Responsible AI Frameworks

Azure Responsible AI Standards

Fairness, reliability, privacy, inclusiveness

AZURE→

GCP Responsible AI Practices

Google's ethical AI principles & tools

GCP→

AWS Responsible AI Hub

Explainability, fairness, privacy tools

AWS→

NIST AI RMF 1.0

Voluntary governance standard for responsible AI

NIST→

Regulatory Standards

EU AI Act — Full Regulatory Text

High-risk AI obligations · Full effect Aug 2026

EU→

ISO/IEC 42001:2023 AI Management

First certifiable international AI standard

ISO→

AwesomeResponsibleAI — Governance

Curated AI governance tools and frameworks

GITHUB→

Model Risk + Explainability

interpretml/interpret — InterpretML

Glassbox models + explainability for ML

⭐ 6k→

slundberg/shap — SHAP Values

Explain model output for any ML model

⭐ 23k→

Trusted-AI/AIF360 — AI Fairness 360

IBM fairness metrics & bias mitigation

⭐ 2.4k→

📺 Watch & Learn

▶

Responsible AI — Governance Frameworks Compared

IBM Technology / YouTube Search~15 min

GOVERNANCE→ ▶

AI Explainability — SHAP & LIME Tutorial

YouTube Search~22 min

XAI→ ▶

AI Bias & Fairness — IBM AIF360 Walkthrough

IBM Technology~18 min

FAIRNESS→ ▶

ISO/IEC 42001 — AI Management System Standard

FedRAMP, HIPAA, SOC2, ISO, CMMC

ISO→

⚖️ AI Risk & Compliance Manager

AI Risk & Compliance Manager

HIPAA, SOC2, GDPR, EU AI Act, FedRAMP, CMMC compliance for AI systems. Maps AI deployments to regulatory requirements with audit-ready documentation.

WHAT IT IS

The AI Risk & Compliance Manager maps AI deployments to regulatory requirements, maintains audit-ready documentation, and ensures AI systems meet the legal and risk standards of their industry. This spans HIPAA for healthcare AI, SOC2 for SaaS, GDPR for EU data subjects, FedRAMP for federal cloud, CMMC for defense contracts, and the EU AI Act for any AI touching EU markets.

HIPAA / GDPRFedRAMPCMMCSOC2 Type IIEU AI ActNIST AI RMF

WHY IT EXISTS

Regulated industries face multi-million dollar penalties for non-compliant AI. An AI system that processes patient data without HIPAA controls, makes lending decisions without explainability, or serves EU users without EU AI Act classification can create catastrophic legal exposure. The AI Risk & Compliance Manager prevents this by ensuring AI systems are compliant before deployment, not after an incident.

HIPAA AI violations carry penalties up to $1.9M per incident per DHHS 2024 guidance on AI-assisted clinical decisions
GDPR Article 22 requires human oversight for automated decisions affecting EU individuals — many AI systems are non-compliant today
FedRAMP + CMMC are gatekeepers for all federal AI contracts — your clearance makes this role uniquely accessible

WHO YOU WORK WITH

CISO / Security Team — aligning AI controls to information security standards
Legal / General Counsel — regulatory interpretation and liability management
ML Engineering — implementing compliance controls in the ML pipeline (audit trails, access controls)
External Auditors — providing evidence for SOC2, FedRAMP, and HIPAA audits
Federal Contracting Officers — for CMMC and FedRAMP authorization processes

HOW TO EXECUTE

Compliance gap analysis: Map each AI system to applicable frameworks — identify gaps before auditors do
Control mapping: Document which technical controls satisfy which regulatory requirements (NIST 800-53 controls → FedRAMP requirements)
Audit trail design: Every AI decision that matters must be logged — who queried the model, what was returned, what action was taken
FedRAMP ATO package: System Security Plan (SSP), Contingency Plan, and continuous monitoring evidence for federal AI systems
AI risk register: Maintain a live inventory of AI-related risks with likelihood, impact, and mitigation status

BEST PRACTICES

Build compliance controls into the ML pipeline early — retrofitting compliance after deployment costs 10× as much as building it in
Automate evidence collection: Compliance evidence (access logs, model version records, bias test results) should be generated automatically, not assembled manually at audit time
Test controls quarterly, not just at audit time — regulatory environments change and controls degrade silently
Privacy by design for AI: Data minimization, purpose limitation, and consent management must be part of the ML data pipeline design, not bolt-on features
Separate training data from PII: Use tokenization or synthetic data for model training — avoid putting real patient or customer data directly into model training sets

Compliance Frameworks

AWS Compliance Programs

AWS→

AWS FedRAMP Authorization

Federal cloud authorization — cleared work

AWS FEDRAMP→

Azure Compliance Documentation

90+ compliance certifications including ITAR

AZURE→

Azure Government FedRAMP High

DoD, IC cloud authorization

AZURE GOV→

NIST AI RMF — Full Framework

GOVERN, MAP, MEASURE, MANAGE functions

NIST→

NIST SP 800-218A Secure AI Dev

Secure software development framework for AI

NIST SP→

DoD AI Strategy (Cleared Roles)

Defense AI adoption — leverages clearance

DOD→

📺 Watch & Learn

▶

NIST AI RMF — Risk Management Deep Dive

NIST / YouTube Search~30 min

NIST RMF→ ▶

FedRAMP Authorization — Cloud for Federal Gov

HIPAA Compliance for AI/ML in Healthcare

FEDRAMP→ ▶

EU AI Act — Risk Categories & Compliance 2025

HIPAA→ ▶

YouTube Search~18 min

EU AI ACT→

Top GitHub Repositories — Hands-On Implementation

16 repos across 6 categories

⚡

DataTalksClub / mlops-zoomcamp

by DataTalksClub

Free 9-week MLOps course — experiment tracking → ML pipelines → orchestration → model deployment → monitoring. Cohort-based with Slack community.

⭐ 11.2k

🐍 Python

📅 Active

mlopsdockermonitoringmlflowprefect

🎯

GokuMohandas / Made-With-ML

by Goku Mohandas

Complete ML lifecycle — design, development, deployment, monitoring. Production-grade software engineering applied to ML. Responsible AI included.

⭐ 37k

🐍 Python

📅 Active

productiontestingci-cdresponsible-ai

🧠

tensorchord / Awesome-LLMOps

by tensorchord

Comprehensive curated LLMOps tools — gateways, eval frameworks, serving solutions, observability, prompt management, guardrails, and fine-tuning tools.

⭐ 5.8k

📚 List

📅 Active

llmopsgatewayevalragguardrails

🦜

langchain-ai / langchain

by LangChain AI

The leading LLM application framework. Build agents, chains, RAG pipelines, tools integrations. Core to modern LLMOps PM skill stack.

⭐ 95k

🐍 Python

📅 Active

ragagentschainstoolsmemory

🚀

bentoml / OpenLLM

by BentoML

Run open-source LLMs (DeepSeek, Llama, Mistral) as OpenAI-compatible API endpoints in the cloud. DevSecOps-ready LLM serving infra.

⭐ 12.2k

🐍 Python

📅 Mar 2026

llm-servinginferencedevsecopsopenai-compat

🔬

mlflow / mlflow

by Databricks

Open-source platform for ML experiment tracking, model registry, deployment, and evaluation. Now includes LLM eval (MLflow 2.x) and prompt engineering tracking.

⭐ 18.9k

🐍 Python

📅 Active

trackingregistryllm-evaldeployment

📚

visenger / awesome-mlops

by Larysa Visengeriyeva

The canonical curated MLOps reference list — books, courses, papers, tools, newsletters, practitioners guide. Cornerstone resource for any MLOps PM.

⭐ 13.8k

📚 Reference

mlopsbookstoolscourses

🏢

microsoft / generative-ai-for-beginners

by Microsoft Azure

18-lesson course building production GenAI apps. Azure OpenAI, Semantic Kernel, vector search, RAG, AI agents. From Microsoft Azure Cloud Advocates.

⭐ 75k

🐍 Python

genaiazureragagentsopenai

🔗

BerriAI / litellm

by BerriAI

Unified LLM gateway — 100+ providers (OpenAI, Anthropic, Bedrock, Vertex) with a single SDK. Cost tracking, load balancing, fallbacks. Essential for LLM platform management.

⭐ 14k

🐍 Python

📅 Active

gatewayproxycost-trackingmulti-provider

👁️

langfuse / langfuse

by Langfuse

Open-source LLM observability. Tracing, metrics, prompt management, evaluations. Self-hostable. Critical for LLMOps monitoring in production deployments.

⭐ 7k

🐍 TypeScript

📅 Active

observabilitytracingevalself-hosted

🛡️

requie / LLMSecurityGuide

by requie · Updated Feb 2026

Comprehensive LLM security reference. OWASP GenAI Top 10 (2025), Agentic Top 10 (2026 ASI prefix), red-teaming tools, guardrails, real-world incidents, and practical defenses.

⭐ Growing

📚 Reference

📅 Feb 2026

llm-securityowaspred-teamprompt-injection

🔴

Azure / PyRIT

by Microsoft AI Red Team

Python Risk Identification Tool for GenAI. Automated red-teaming of LLM systems. Identifies safety and security vulnerabilities. MITRE ATLAS mapped.

⭐ 2.1k

🐍 Python

📅 Active

red-teamai-securityllm-safetyatlas

🧪

leondz / garak

by Leon Derczynski

LLM vulnerability scanner. Probes for prompt injection, hallucination, jailbreak, toxic generation, and data leakage risks. Like nmap for LLMs.

⭐ 4.5k

🐍 Python

📅 Active

llm-scanningvulnerabilitypentesthallucination

⚡

ray-project / ray

by Anyscale

Distributed computing for AI/ML. Scale training, inference, and serving. RayServe for LLM serving. Used by OpenAI, Cohere, and Hugging Face in production.

⭐ 34k

🐍 Python

📅 Active

distributedservingtrainingscaling

🐦

iterative / dvc

by Iterative

Data Version Control — Git for ML data, models, and experiments. Pipelines, remote storage, experiment tracking. Foundation of any mature MLOps stack.

⭐ 14k

🐍 Python

📅 Active

data-versioningpipelinesgitexperiments

🤖

anthropics / anthropic-cookbook

by Anthropic

Production patterns for Claude API: tool use, RAG, multimodal, agents, prompt caching, computer use, MCP integrations. Direct Anthropic guidance.

⭐ 10k+

🐍 Python

📅 Active

claudeanthropicragagentsmcp

AWS Certified Machine Learning – Specialty

Certification Tracks — All Platforms

15 certifications · direct exam links

☁️

Amazon Web Services

4 Certifications

MLS-C01 · HIGH VALUE for AI roles

AWS Solutions Architect – Professional

SAP-C02 · Architecture depth

AWS Security – Specialty

SCS-C02 · Critical for AI Security PM

AWS DevOps Engineer – Professional

DOP-C02 · CI/CD + IaC proficiency

AWS Certified AI Practitioner

AIF-C01 · NEW 2024 · Foundational AI

⚡

Microsoft Azure

4 Certifications

Azure AI Engineer Associate

AI-102 · Prime target for AI PM roles

Azure Solutions Architect Expert

AZ-305 · Expert-level architecture

Azure Security Engineer Associate

AZ-500 · Security posture management

Azure DevOps Engineer Expert

AZ-400 · CI/CD, IaC, DevSecOps

🔵

Google Cloud

3 Certifications

Professional Machine Learning Engineer

ML-ENG · Top credential for LLMOps PM

Professional Cloud Architect

PCA · Infrastructure + AI design

Professional Cloud DevOps Engineer

PCDE · CI/CD, SRE, reliability

🛡️

Security & AI Safety

3 Certifications

CompTIA Security+

SY0-701 · DoD 8570 Baseline

CISSP — Certified Information Systems Security Professional

ISC2 · Enterprise security leadership

ISO/IEC 42001 AI Management System

First certifiable international AI standard · NEW

30-Day Implementation Sprint

20 action items · click to check off

🔥

WEEK 01

Foundation — Tools & Frameworks

✓

Clone tensorchord/Awesome-LLMOps and audit tooling gaps vs current stack

✓

Complete AWS Well-Architected ML Lens self-assessment for client environments

✓

Review NIST AI RMF and map functions to Flexera / state gov client

✓

Set up MLflow experiment tracking in test project on kogunlowo123 GitHub

✓

Run garak LLM vulnerability scan on Flamoral / NexusAI API surface

⚡

WEEK 02

LLMOps — RAG + Serving

✓

Deploy LangChain RAG pipeline end-to-end with Pinecone + Claude API

✓

Implement litellm gateway proxying AWS Bedrock + Azure OpenAI

✓

Configure Langfuse observability — trace RAG calls + cost metrics

✓

Document LLMOps architecture in new kogunlowo123/rag-llmops-gateway repo

✓

🛡️

WEEK 03

AI Security — Red Team + Compliance

✓

Build MITRE ATLAS threat model for Flamoral and NexusAI AI systems

✓

Run Microsoft PyRIT red team scan on LLM endpoints

✓

Document AI Security architecture referencing Secret Clearance experience

✓

Apply for AI Security PM roles on ClearanceJobs.com — target $200K+

✓

Schedule AWS Security Specialty (SCS-C02) exam prep track

🎯

WEEK 04

Job Campaign — Applications + Portfolio

✓

Run linkedin_optimizer.py — target AI Technical Program Manager headline

✓

ATS-optimize 5 resume variants mapped to each priority role

✓

Publish AI Platform Architecture article on kogunlowo123.github.io

✓

Submit 50+ applications to $180K–$260K+ AI program roles via LinkedIn + Dice

✓

Launch Citadel Cloud Management LLMOps module using this hub as curriculum

Additional Frameworks, APIs, Tools & Job Search

🤖 Anthropic / Claude API

Claude & Anthropic Platform Resources

Direct API docs, model capabilities, MCP servers, prompt engineering, and Claude Code for agentic system building. Your primary AI development platform.

Docs & Guides

Anthropic API Documentation

Models, messages, tool use, streaming, batches

ANTHROPIC→

Prompt Engineering Guide

Chain-of-thought, XML tags, few-shot, system prompts

ANTHROPIC→

Model Context Protocol (MCP) Docs

Connect Claude to tools, databases, APIs

MCP→

Claude Code Documentation

Agentic coding CLI — used in your workflow

CLAUDE CODE→

Extended Thinking — Deep Reasoning

Complex reasoning with thinking tokens

ANTHROPIC→

Prompt Caching — Cost Optimization

Cache prefills, reduce cost by 90%

ANTHROPIC→

Batch Processing API

Async bulk requests at 50% cost reduction

ANTHROPIC→

Claude Model Cards & Pricing

Opus 4.6, Sonnet 4.6, Haiku 4.5 specs

MODELS→

Anthropic Prompt Library

Ready-to-use prompt templates for common tasks

TEMPLATES→

GitHub Repos

anthropics/anthropic-cookbook

Production patterns, RAG, agents, tool use

GITHUB→

anthropics/anthropic-sdk-python

Official Python SDK

SDK→

📺 Watch & Learn

▶

Claude API Tutorial — Tool Use & Agents (Python)

Anthropic / YouTube Search~25 min

CLAUDE API→ ▶

Model Context Protocol (MCP) — Full Explainer

Anthropic / YouTube Search~20 min

MCP→ ▶

Prompt Engineering — Anthropic Best Practices

Anthropic / YouTube Search~15 min

PROMPTING→

🔧 Data Engineering Stack

Data & AI Delivery Manager Resources

Combines data engineering with AI delivery. Lake Formation, Data Factory, BigQuery, Kafka, and streaming pipelines for AI-ready data infrastructure.

Cloud Data Platforms

AWS Lake Formation

Data lake governance and security at scale

AWS→

AWS Glue — ETL for ML

Serverless data integration and cataloging

AWS→

Azure Data Factory

ETL/ELT at scale, 90+ connectors

AZURE→

GCP BigQuery ML Docs

SQL-based ML, feature store, analytics hub

GCP→

Streaming + Feature Engineering

apache/kafka — Event Streaming

Real-time ML feature pipelines

⭐ 28k→

feast-dev/feast — Feature Store

Open-source ML feature store

⭐ 5.5k→

📺 Watch & Learn

▶

Data Pipelines for ML — Feature Engineering End-to-End

YouTube Search~40 min

DATA PIPELINES→ ▶

Kafka for Real-Time ML Feature Pipelines

Confluent / YouTube Search~30 min

KAFKA→ ▶

AWS Lake Formation — Data Lake Tutorial

AWS / freeCodeCamp~25 min

AWS→

🌐 Job Search — Targeted Platforms

Job Search Platforms by Role Type

Targeted platforms for AI leadership roles with Secret Clearance as primary differentiator. Boolean strings pre-optimized for each site's search engine.

Cleared & Defense Roles

ClearanceJobs — AI Program Manager (Secret)

Defense sector · highest salary premium for clearance

SECRET ★→

ClearanceJobs — AI Security Architect (Secret)

$190K–$280K+ with active clearance

SECRET ★→

Tech & Enterprise Roles

LinkedIn — AI Technical Program Manager

Largest volume of AI PM roles

LINKEDIN→

LinkedIn — LLMOps / GenAI Program Manager

Fastest-growing role category 2025-2026

LINKEDIN→

Dice — AI Platform Architect (Remote)

Strong tech contractor + perm roles

DICE→

Dice — MLOps + DevSecOps + AI

Boolean: MLOps AND DevSecOps AND AI

DICE→

Specialized Boards

Anthropic — Open Positions

AI-native company · technical PM + security roles

ANTHROPIC→

OpenAI — Technical Program Roles

Platform, infrastructure, security PM openings

OPENAI→

Google Careers — AI TPM

Google DeepMind, Vertex AI team roles

GOOGLE→

📡 Community & Learning

Community, Courses & Continuous Learning

Active communities, free courses, newsletters, and structured learning paths for staying current in the fast-moving AI platform engineering space.

Free Courses & Bootcamps

MLOps Zoomcamp — Free 9-week Course

DataTalksClub · cohort + self-paced

FREE→

HuggingFace NLP Course — Free

Transformers, fine-tuning, deployment

FREE→

DeepLearning.ai Short Courses

LangChain, MLOps, LLMOps, RAG — free short courses

DEEPLEARNING.AI→

Newsletters & Communities

MLOps Community — Slack + Newsletter

5k+ practitioners · weekly newsletter

COMMUNITY→

AI Snake Oil Newsletter

Critical AI analysis — governance focus

NEWSLETTER→

Latent Space — LLMOps Podcast

Top AI engineers on production systems

PODCAST→

eugeneyan/applied-ml

ML papers + posts from production teams

⭐ 27k→

🔌 AI Model Providers & APIs

AI Model Provider APIs & Documentation

Direct API documentation for every major AI model provider. Unified gateways, multi-provider routing, and enterprise-grade API management.

Proprietary Model APIs

OpenAI API Documentation

GPT-4o, o1, o3, GPT-4.5, DALL-E, Whisper

OPENAI→

Google Gemini API Docs

Gemini 2.0, multimodal, function calling

GOOGLE→

Mistral AI Documentation

Mistral Large, Codestral, function calling

MISTRAL→

Cohere API Documentation

Command R+, Embed, Rerank — enterprise NLP

COHERE→

Perplexity API — Search-Augmented LLM

Online models with real-time web search

PERPLEXITY→

Unified Gateways & Routers

BerriAI/litellm — 100+ Provider Gateway

Unified SDK, cost tracking, load balancing

⭐ 14k→

OpenRouter — Multi-Model API Router

One API, any model — automatic fallbacks

OPENROUTER→

Cloud AI Platforms

AWS Bedrock — Multi-Model Foundation

Claude, Llama, Mistral, Titan on AWS

AWS→

Azure OpenAI Service

Enterprise GPT-4o, o1, DALL-E on Azure

AZURE→

GCP Vertex AI — Model Garden

Gemini, Claude, Llama on Google Cloud

GCP→

🧠 Open Source Models & Local Inference

Open Source Models, Hugging Face & Local AI

Open-weight models, fine-tuning tools, and local inference engines. Run AI models on your own hardware with full control over data privacy.

Model Hubs & Leaderboards

Hugging Face Hub — Model Repository

1M+ models, datasets, spaces — the GitHub of AI

FREE→

Open LLM Leaderboard

Benchmark comparisons across all open models

FREE→

Chatbot Arena — Live Model Comparison

Community voting to rank LLMs head-to-head

FREE→

Top Open-Weight Model Families

Meta Llama 3 / 4 — Open Weights

405B, 70B, 8B parameter models — industry leading open source

META→

DeepSeek V3 / R1 — Reasoning Models

671B MoE, R1 reasoning chain — competitive with GPT-4

FREE→

Qwen 2.5 Series — Alibaba

72B, 32B, 7B — strong coding & multilingual

FREE→

Mistral / Mixtral Open Models

MoE architecture, Apache 2.0 licensed

FREE→

Google Gemma — Small Open Models

2B, 7B, 27B — efficient on-device AI

FREE→

Local Inference Engines

Ollama — Run LLMs Locally

One-command install, macOS/Linux/Windows

FREE→

LM Studio — Desktop LLM Runner

GUI app for running GGUF models locally

FREE→

vLLM — High-Throughput LLM Serving

PagedAttention, continuous batching — production grade

⭐ 35k→

Jan — Open Source Local AI Assistant

Privacy-first, offline-capable AI chat

FREE→

Fine-Tuning Tools

unsloth — 2× Faster LLM Fine-Tuning

QLoRA, 70% less memory, free Colab notebooks

⭐ 20k→

HuggingFace PEFT — Parameter-Efficient Fine-Tuning

LoRA, QLoRA, prefix tuning, prompt tuning

⭐ 17k→

HuggingFace TRL — RLHF Training

PPO, DPO, reward modeling for LLMs

⭐ 10k→

📺 Watch & Learn

▶

Ollama — Run Any LLM Locally (Getting Started)

Fine-Tune LLMs with QLoRA & Unsloth — Full Tutorial

LOCAL AI→ ▶

YouTube Search~40 min

FINE-TUNE→

🎓 Prompt Engineering & AI Learning

Prompt Engineering Guides & AI Courses

Master prompt engineering across all major model providers. Structured learning paths from beginner to production-grade AI application development.

Official Prompt Guides

Anthropic Prompt Engineering Guide

XML tags, chain-of-thought, system prompts

ANTHROPIC→

OpenAI Prompt Engineering Guide

GPT-specific strategies and best practices

OPENAI→

Google Gemini Prompt Best Practices

Multimodal prompting, grounding, safety

GOOGLE→

DAIR.AI — Comprehensive Prompt Guide

CoT, ToT, self-consistency, few-shot — all techniques

FREE→

Learn Prompting — Interactive Course

Beginner to advanced prompt engineering

FREE→

Top AI Courses & Learning Platforms

DeepLearning.AI — Free Short Courses

LangChain, LLMOps, RAG, Prompt Eng with Andrew Ng

FREE→

fast.ai — Practical Deep Learning

Top-down approach, free courses + textbook

FREE→

Full Stack Deep Learning

Production ML from experiment to deployment

FREE→

LangChain Academy

LangGraph, RAG, agent courses — official

FREE→

Hugging Face Learn — NLP & Transformers

Free course on transformer models and pipelines

FREE→

Google ML Crash Course

ML fundamentals with hands-on exercises

FREE→

Kaggle Learn — Micro-Courses

Python, ML, deep learning — hands-on notebooks

FREE→

Weights & Biases Courses

Experiment tracking, LLM fine-tuning, evals

FREE→

Research & Papers

Papers With Code — ML Papers + Impl

Every ML paper with linked GitHub repos

FREE→

arXiv cs.AI — Latest Research

Preprint papers in AI, CL, ML, LG

FREE→

Anthropic Research

Constitutional AI, interpretability, safety

ANTHROPIC→

MCP Ecosystem, AI Coding Tools & Agent Frameworks

3 categories · 55+ links NEW 2026

🔌 MCP — Model Context Protocol Ecosystem 🔥 2026

MCP Servers, SDKs & Integration Ecosystem

The Model Context Protocol (MCP) is the universal standard for connecting AI models to tools, databases, and APIs. Official spec by Anthropic — rapidly adopted across the industry. Critical infrastructure for agentic AI systems.

Official MCP Resources

MCP Official Specification

Protocol spec, architecture, transport layers

SPEC→

MCP GitHub Organization

Official repos — servers, SDKs, specification

GITHUB→

Official MCP Servers — Reference Implementations

Filesystem, GitHub, Postgres, Slack, Puppeteer, more

OFFICIAL→

MCP TypeScript SDK

Build MCP servers & clients in TypeScript

SDK→

MCP Python SDK

Build MCP servers & clients in Python

SDK→

MCP Server Registries & Discovery

awesome-mcp-servers — Community Catalog

Largest curated list of MCP servers

⭐ 30k+→

mcp.so — MCP Server Directory

Search, filter, and discover MCP servers

DIRECTORY→

Smithery — MCP Server Registry

Install and manage MCP servers easily

REGISTRY→

mcpservers.org — Community Catalog

Browse MCP servers by category and provider

CATALOG→

Popular MCP Servers by Category

MCP GitHub Server — Repo Management

Issues, PRs, code search, file ops

DEVTOOLS→

MCP PostgreSQL Server — Database Access

Schema inspection, SQL queries via MCP

DATABASE→

MCP Slack Server — Workspace Integration

Read channels, post messages, search history

COMMS→

MCP Puppeteer — Browser Automation

Navigate, screenshot, interact with web pages

BROWSER→

MCP Brave Search — Web Search

Real-time web search via Brave API

SEARCH→

MCP Google Drive — File Access

Search, read, manage Drive files

CLOUD→

MCP Sentry — Error Monitoring

Query issues, view stack traces, manage alerts

MONITORING→

Cloudflare MCP Server

Workers, KV, D1, R2 management via MCP

INFRA→

Claude Code MCP Integration

Claude Code MCP Configuration

Install, configure, manage MCP servers in Claude Code

CLAUDE CODE→

Claude Code Plugins

Extend Claude Code with community plugins

PLUGINS→

Claude Code Hooks — Automation

Pre/post command hooks for workflow automation

HOOKS→

Claude Code Sub-Agents

Specialized agent workers for parallel tasks

AGENTS→

📺 Watch & Learn

▶

Model Context Protocol (MCP) — Full Explainer

Anthropic / YouTube Search~20 min

MCP→ ▶

Build Your Own MCP Server — Step-by-Step

YouTube Search~30 min

BUILD→ ▶

Claude Code + MCP Servers — Setup & Demo

Claude Code — Anthropic CLI Agent

DEMO→

💻 AI Coding Assistants & Developer Tools

AI-Powered Development Tools & IDE Integrations

From copilots to autonomous agents — every major AI coding tool for accelerating software engineering. IDE integrations, terminal agents, and autonomous dev platforms.

AI Coding Agents — Terminal & CLI

Agentic coding in terminal — your current workflow

ANTHROPIC→

Aider — AI Pair Programming in Terminal

Works with any LLM, Git-aware, multi-file editing

⭐ 25k→

OpenHands — Open Source AI Dev (fmr OpenDevin)

Autonomous software engineering agent

⭐ 40k→

SWE-agent — Autonomous Bug Fixer

Princeton NLP — resolves GitHub issues autonomously

⭐ 14k→

AI-Powered IDEs

Cursor — AI-First Code Editor

VS Code fork with deep AI integration, Composer agent

FREE / $20/MO→

Windsurf (Codeium) — Cascade Agent IDE

AI IDE with autonomous multi-step coding

FREE / PAID→

GitHub Copilot — AI Pair Programmer

Inline completions, chat, workspace agents

$10-39/MO→

JetBrains AI Assistant

AI features for IntelliJ, PyCharm, WebStorm

PAID→

VS Code Extensions

Cline — Autonomous Coding Agent for VS Code

Plan, code, execute — autonomous file editing

⭐ 30k→

Continue — Open Source AI Code Assistant

Use any LLM as your IDE copilot

⭐ 20k→

Sourcegraph Cody — Codebase-Aware AI

Full repo context, multi-file editing

FREE / PAID→

AI App Builders

v0 by Vercel — AI UI Generation

Generate React components from prompts

FREE / PAID→

bolt.new — Full-Stack AI App Builder

Generate and deploy apps from natural language

FREE / PAID→

Replit — AI-Powered Online IDE

Code, deploy, collaborate with AI agent

FREE / PAID→

Devin — Autonomous AI Software Engineer

Full autonomy — plans, codes, tests, deploys

PAID→

📺 Watch & Learn

▶

Claude Code — Agentic Coding Demo & Tutorial

Anthropic / YouTube Search~20 min

CLAUDE CODE→ ▶

AI IDE Comparison — Cursor vs Copilot vs Windsurf

Cline — Autonomous Coding in VS Code

COMPARISON→ ▶