Founding AI Engineer
Own Matterhaul's production AI stack end-to-end — agents, retrieval, evals, safety, cost, and latency for quoting, ordering, procurement, and dispatch workflows. Founding AI seat for a senior/staff engineer who has operated real LLM systems in production.
About Matterhaul
Matterhaul is building the AI-native operating system for the physical goods supply chain.
Distributors and manufacturers run on disconnected systems, manual re-entry, and tribal knowledge that never gets captured. The software that was supposed to fix that never did. We're changing that.
Matterhaul sits above the systems these businesses already run — unifying their data, capturing the operational context legacy software misses, and deploying AI agents across quoting, order entry, procurement, dispatch, and customer updates. No rip and replace. Teams go live fast, and Matterhaul expands until it becomes the system the business runs on.
That's the wedge. The vision is bigger: a purpose-built, AI-native platform that doesn't just automate what ERPs do today — it does what they were never capable of.
We're a small team with deep roots in this space. Our founders grew up in the trades and spent careers building products for the physical world at Stripe, Verkada, and Cisco Meraki. We're based in San Francisco's SOMA/Transbay neighborhood, in-office four days a week, and we spend real time with the distributors and operators we build for.
We move fast, ship often, and build for the people who actually do the work.
Why now
Three things are true at once and rarely line up:
- AI has stopped being an experiment. A year ago this was a pilot conversation. Today, executives are being asked why they aren't deploying AI in operations.
- Legacy ERPs are no longer defensible. The customers know it. Their leadership teams are openly looking for a way out.
- The supply-chain trauma of the last five years is fresh. The companies that move physical goods lived through it in a way software companies didn't. They want visibility and automation, and they aren't going back.
The window is open. Windows like this don't stay open.
Why this role exists
The AI is the product. Quoting, order intake, procurement triage, dispatch — every workflow we ship leans on a model doing real work, against messy distributor data, with real money on the line. Today we run multi-LLM (OpenAI, Anthropic, Google) via the Vercel AI SDK and LangChain, embeddings via Voyage, pgvector for retrieval, a fact store with a custom ontology, and an Effect.js workflow cluster orchestrating long-running agent runs. It works. It is also early.
We need a senior/staff AI engineer who has already operated AI systems in production for 18+ months — through the eval debt, the silent regressions, the prompt drift, the 3am "why is the model suddenly hallucinating SKUs" pages — and can bring that scar tissue here. You'll own the agent and retrieval stack end-to-end: architecture, evals, safety, cost, latency, and the path from "demo magic" to "system of record."
This is a high-leverage seat. The agents you build are the company's product surface.
What you'll own
Year one, concretely:
- Agent architecture — Tool-using agents for quoting, order intake, procurement, and dispatch. Plan/execute loops with checkpointing, deterministic replays, human-in- the-loop gates, and graceful degradation when a step fails. Decide where models belong, where deterministic code belongs, and where the seam goes.
- Retrieval & memory — We have pgvector, a fact store with an ontology layer, and per-organization knowledge. You'll evolve retrieval (hybrid search, reranking, query rewriting), make the call on dedicated vector store vs. pgvector at scale, and design the long-term memory model an agent needs to act inside a customer's business.
- Evals & quality — Build the evaluation harness this company runs on. Offline evals, online evals, regression suites tied to CI, golden sets per workflow, LLM-as- judge where it earns its keep, human review where it doesn't. Define what "good" means per agent and make regressions impossible to merge silently.
- Observability for AI — Token-level tracing, prompt/response capture, cost per workflow, latency budgets per step, drift detection. Sentry + OpenTelemetry are wired for app code; the AI side needs the same rigor.
- Model routing & cost — Pick the right model for the right step. Route between Opus/Sonnet/Haiku, GPT-class, Gemini-class, and open-weights where appropriate. Negotiate the latency/quality/cost frontier explicitly, not by accident.
- Safety & guardrails — Prompt-injection defense (we ingest customer documents, emails, and call transcripts), tool-use authorization (every tool call respects OpenFGA), PII handling, jailbreak resistance, and refusal behavior tuned for a B2B context where "I can't help with that" is itself a product failure.
- Voice & document pipelines — ElevenLabs for outbound calls, Plaud for inbound transcripts, multi-modal extraction from supplier PDFs and scanned quotes. You'll own the model side of these pipelines, not just the plumbing.
- Research → production loop — Triage what's worth trying from the frontier (new models, new techniques, fine-tuning, distillation, structured decoding) and ship the ones that move a metric. Kill the ones that don't, fast.
You will write code. You will also make the architectural calls and document the "why" so the team can keep building after each decision.
What we're looking for
Must have:
- 18+ months running AI systems in production — not prototypes, not internal demos. Real users, real failure modes, real on-call. You have stories about regressions you caught (and ones you didn't).
- 5+ years of software engineering overall. The AI work sits on top of a real engineering foundation; it doesn't replace one.
- Deep familiarity with the modern LLM stack — at least two of Anthropic, OpenAI, or Google APIs at depth, tool use, structured outputs, streaming, prompt caching, the cost/latency tradeoffs of each.
- Evals as a first-class discipline. You've built an eval harness from scratch. You can argue about LLM-as-judge calibration, golden-set rot, and the difference between "the eval went up" and "the product got better."
- Retrieval at production scale — chunking strategy, embedding model selection, hybrid search, reranking, query rewriting. You know why naive RAG breaks and how to fix it.
- Agent architecture experience — multi-step tool-using agents, not single-turn prompt apps. Plan/execute, ReAct, or your own variant; you've debugged the loops at step 14 of 30 when something went sideways.
- TypeScript or Python at depth. Our AI surface is largely TypeScript (Vercel AI SDK, LangChain, Effect.ts); Python is welcome where it pays off (eval tooling, model experimentation, fine-tuning).
- Clear writing. Prompts are writing. Specs are writing. Postmortems are writing.
Nice to have:
- Fine-tuning, distillation, or post-training experience (LoRA, RLHF/DPO, SFT on domain data). We'll need this eventually; we don't need it day one.
- Voice / telephony AI (ElevenLabs, Twilio, Deepgram, Whisper) in production.
- Structured information extraction from messy documents (POs, invoices, supplier spec sheets).
- Effect.ts familiarity — we're heavy users in the workflow cluster.
- Knowledge-graph / ontology / semantic-fact-store experience. We have one and it matters.
- ReBAC / authorization-aware tool use (we use OpenFGA — every agent action is authorized).
- Published work, OSS contributions, or prior AI systems you can point at.
How we work
- Small founding team — under ten people — building the system distributor sales, procurement, and dispatch teams will run on. You will be the most senior AI voice in the room.
- San Francisco — 4 days a week in the office. Agent design goes faster at a whiteboard with the team in the room, and the tightest prompt iteration loops happen in person.
- We ship to production frequently and trust each other to do it.
- We write specs (
/specs) and architecture docs (AGENTS.mdper directory) before big changes. We expect the same of you.
Compensation
- Base: $200,000 – $260,000, depending on level and experience.
- Equity: 0.5% – 1.5%. This is a founding-engineer grant; the range reflects the spread between senior and staff/principal.
- Health / dental / vision; 401(k); commuter benefits.
- Hardware + AI coding / API budget and a real office in SF.
Apply
Email hiring@matterhaul.com with a short note on an AI system you've shipped to production and one thing about it you'd build differently today — a prompt, an eval, a retrieval choice, an agent loop. Resume optional, story not.
Ready to apply?
We'd love to hear from you. Send us your resume, LinkedIn, and a note to: hiring@matterhaul.com or click the button below.
Apply Now