Position Overview
We are looking for a Senior Agentic AI Developer to join our Agentic Infrastructure Observability Team — a greenfield engineering team in Raleigh, NC, building a platform that supports large datacenters and telecom infrastructure problems the way a senior network engineer would. This role is ideal for a hands-on Python developer who has built real LLM-powered systems and wants to work on something that runs against live network and server telemetry, makes decisions with a human in the loop, and has a permanent audit trail of every action it takes.
This is a hybrid role located in Raleigh, NC.
The ideal candidate has solid Python and async engineering skills, practical experience with LLM orchestration frameworks, and enough curiosity about network operations to care why a BGP session flapping every 20 minutes matters differently than one that dropped once. You will write agent programs, build MCP tool integrations, wire up telemetry pipelines, apply ML to time-series metrics, notification-based human-in-the-middle workflows, and build the evaluation harnesses that prove the platform stays trustworthy as it grows.
What We're Building
A platform that autonomously investigates infrastructure incidents for large datacenters — tracing faults through a live topology graph, identifying every downstream service and customer affected, and surfacing a clear explanation with a proposed fix waiting for operator approval. The agent reasons over real telemetry: Prometheus metrics, Loki logs, gNMI streams from SONiC fabric switches, and Redfish events from Dell iDRAC servers. All of its data access happens through a typed MCP tool interface — so every conclusion is traceable, every action is logged, and nothing changes on the network without a human saying go.
This is a new platform. We will develop a separate frontend that works with the main GUI engineering team at BE Networks headquarters in Dallas, TX, consumes the platform's APIs and surfaces agent findings, investigation status, and operator approval workflows through the BE Networks GUI product. Your work ends at the API boundary — what happens on the other side of it is someone else's responsibility.
The same MCP tool endpoints and REST APIs that support the BE Networks GUI are published as an integration surface for third-party systems. Partner network management platforms, independent NOC dashboards, and ecosystem tooling can connect directly to the platform's MCP server and API endpoints to trigger investigations, retrieve findings, consume telemetry, and push approvals — without going through the BE Networks GUI. If you have experience designing APIs for external consumers and partners — versioned contracts, well-documented schemas, stable authentication patterns — that experience applies directly here.
The infrastructure topology lives in Neo4j. The investigation pipeline is built in DSPy. The whole stack runs on Docker and Kubernetes, and the team ships on the BE Networks SONiC-based platform, backed by a Dell Technologies OEM partnership.
Responsibilities
- Design and maintain the REST/async API surface that the BE Networks Dallas GUI team and third-party partner integrations consume — versioned endpoints, stable contracts, documented schemas, and authentication patterns that work for both internal and external callers
- Collaborate with the Dallas GUI engineering team on API contract definition, investigation result payloads, audit ledger access, and approval workflow endpoints — you own the backend contract; they own how it renders
- Build and optimize DSPy agent programs that drive the investigation pipeline end-to-end: interpret alert → gather telemetry and graph context → hypothesize root cause → summarize evidence → propose remediation
- Design and implement the MCP tool server that exposes the Neo4j infrastructure graph to the agent — service lookup, resource and dependency retrieval, notes, events, impact discovery, and result publication
- Implement the two-phase propose→approve→apply governance flow: approver identity enforcement, opaque rollback tokens, and an append-only audit ledger that treats every proposal, application, and rollback as an immutable record
- Write PromQL that goes beyond dashboards — recording rules for ML feature pipelines, alerting rules that feed the investigation queue, and range-vector expressions used directly as agent tool inputs
- Apply ML to network and server telemetry: anomaly detection, trend forecasting, capacity prediction, and grey/brown failure detection with temporal reasoning so resolved incidents don't get flagged as active root causes
- Build the RAG layer over operational logs and vendor documentation — chunking, embedding, retrieval, and grounding so agents can cite the runbook they pulled during an investigation
- Build telemetry ingestion pipelines from SONiC fabric devices (gNMI subscriptions) and Dell iDRAC servers (Redfish SSE/polling) into Prometheus, Loki, Grafana, and Neo4j
- Build evaluation harnesses and behavioral regression tests that measure hallucination rate, protocol compliance, and tool misuse — and run them in CI so a reliability regression blocks the merge
- Containerize and deploy services with Docker and Kubernetes, partnering with SRE on CI/CD, health probes, and scaling
- Mentor junior engineers through code reviews, help establish engineering best practices around testing, observability, and AI governance
Requirements
- 3+ years of Python development experience, including async patterns (asyncio, httpx, FastAPI) and Pydantic v2 for schema definition
- 1+ years designing and building LLM-powered or agentic systems that shipped and ran against real data — not tutorials or internal demos that never left a laptop
- Hands-on experience with DSPy, LangGraph, LangChain, or a comparable LLM orchestration framework, including tool/function calling and multi-step agent behavior
- Experience implementing RAG systems — document processing, embeddings, vector or hybrid retrieval, and grounding discipline
- PromQL and Prometheus: rate(), increase(), histogram_quantile(), multi-label aggregations; Grafana dashboard and alert configuration
- Working knowledge of ML applied to time-series data — anomaly detection, forecasting, or feature engineering on operational metrics
- Familiarity with Neo4j or another graph database and comfort writing multi-hop traversal queries in Cypher
- Experience deploying and operating services on Docker and Kubernetes
- Understanding of why auditability, reproducibility, and reversibility matter in operational systems — and a habit of building toward them from the start
Preferred Qualifications
- Experience designing APIs for external consumers — versioned REST contracts, OpenAPI/Swagger documentation, partner authentication patterns (OAuth2, API keys, scoped tokens), and stable schemas that third-party teams can build against reliably
- Familiarity with network operations: BGP session lifecycle, interface error counters (CRC, drops, runts), gNMI streaming telemetry, or SONiC/Cumulus/Arista EOS
- Experience with Redfish or iDRAC for server BMC telemetry
- Familiarity with Loki and LogQL for structured log ingestion and retrieval
- MCP (Model Context Protocol) integrations and agentic authentication — scoped tokens, identity management for agents calling external tools
- Exposure to NVIDIA NIM, NetQ, or Air for LLM inference or fabric telemetry
- Experience with LLM evaluation frameworks (dspy.Evaluate, LangSmith, or similar) and awareness of faithfulness and hallucination rate as measurable metrics
- Experience contributing to a new platform in its early stages — comfortable shipping under ambiguity and making pragmatic tradeoffs
Pay: $150,000.00 - $205,000.00 per year
Benefits:
- 401(k)
- Flexible schedule
- Flexible spending account
- Health insurance
- Health savings account
- Life insurance
- Paid time off
- Professional development assistance
- Vision insurance
Education:
Experience:
- Python: 3 years (Required)
- designing LLM/agentic systems: 1 year (Required)
Work Location: Hybrid remote in Morrisville, NC 27513