It was a humid October morning in Bengaluru when I first stumbled across Microsoft’s new open-source benchmarking tool—ExCyTIn-Bench. I had just wrapped up a messy Sentinel alert triage session and was knee-deep in trying to validate a Copilot workflow when this popped up in my feed. Not gonna lie, I almost skipped it. Another benchmark? Great. But this one’s different.

Why I dove into ExCyTIn-Bench
I’ve been burned by AI benchmarks before. Most of them feel like glorified trivia quizzes—static datasets, multiple-choice questions, and zero context for how real SOCs operate. So when I read that ExCyTIn-Bench simulates multistage cyberattack scenarios using live queries across 57 log tables in Microsoft Sentinel, I perked up.
Microsoft built this to reflect the chaos we deal with daily: noisy data, alert fatigue, and the need to stitch together evidence across Defender, Sentinel, and now Security Copilot. That’s my world. And if this tool can help me validate AI workflows before they hit production, I’m all in.
My setup and first run (Dev environment only!)
Before anyone asks—no, this isn’t GA-ready for full production use. I tested it in a dev tenant with Sentinel and Defender integrations, running on a modest Hyper-V lab setup (ThinkPad X1 Extreme Gen 4, 32GB RAM, nested virtualization, Azure Arc connected). I pulled the repo from GitHub, spun up the SOC simulation, and started feeding it test alerts.
The install wasn’t plug-and-play. The documentation assumes you’re familiar with incident graphs and bipartite alert-entity mapping. I had to wing it a bit—started with Sentinel’s workbook templates, then jumped into KQL to align with the benchmark’s query structure.
What caught me off guard
- The feedback granularity: Instead of a binary pass/fail, ExCyTIn-Bench gives stepwise feedback on each investigative action. That’s gold. I could see where my AI agent hesitated, skipped a lookup, or misinterpreted an entity relationship.
- Alert-entity graphs: These are built by human analysts and feel eerily close to how I map incidents manually. It’s not just “did the AI get the answer?”—it’s “did it ask the right questions?”
- Operational cost tracking: I didn’t expect this. The tool logs how much compute and API usage each AI decision incurs. That’s a game-changer for budgeting Copilot workflows.
Bugs, quirks, and lessons learned
- Latency spikes: When querying across all 57 tables, I hit some nasty delays. Turns out, my Sentinel workspace wasn’t optimized for cross-table joins. Lesson: pre-index your high-volume tables before running full simulations.
- Entity resolution: My AI agent kept confusing device IDs with user aliases. I had to tweak the enrichment logic manually. Most guides say the default mapping works, but I found custom entity tagging far more reliable.
- Copilot integration: It’s early days, but the benchmark does support Security Copilot. I had to manually map the investigative goals to Copilot prompts—no native UI yet.
Final thoughts
Microsoft’s ExCyTIn-Bench isn’t just another benchmark—it’s a sandbox for stress-testing AI in real SOC conditions. If you’re a CISO, SecOps lead, or just a curious admin like me, this tool gives you a lens into how your AI agents think, act, and sometimes fumble.
It’s open-source, actively maintained, and already being used internally by Microsoft to validate their own models. Future updates promise tenant-specific threat scenarios, which I’m itching to try once they drop.
What about you?
Ever tried simulating a full incident response flow with AI? Did it feel like magic or madness? I’d love to hear how others are using ExCyTIn-Bench—or if you’ve built your own benchmarks. Drop your stories, setups, or even rants below. Let’s make AI in cybersecurity less of a black box and more of a shared craft.