Resources
This is a deliberately opinionated set of references I come back to when building systems and when designing AI-heavy platforms. If a link feels "boring," it is usually because it survived contact with production.
Architecture and systems references
- Designing Data-Intensive Applications - The baseline vocabulary for data systems: storage engines, replication, partitioning, stream processing, and the tradeoffs you will otherwise rediscover painfully.
- The Datacenter as a Computer - Warehouse-scale computing fundamentals; clarifies why latency, tail behavior, and utilization are always in tension.
- Google SRE Book - Reliability principles and mechanisms (SLOs, error budgets, on-call hygiene) described by people who operate what they build.
- Google SRE Workbook - The more actionable companion; especially useful when you need to translate reliability goals into engineering tasks.
- AWS Builders' Library - High-signal essays on operational architecture, resilience, and failure economics; vendor-hosted, but broadly applicable.
- Release It! (Michael Nygard) - Stability patterns that show up everywhere: timeouts, bulkheads, back-pressure, circuit breakers, and capacity as a first-class design constraint.
- The Architecture of Open Source Applications - Deep dives into real systems by the people who built them; a great antidote to architecture-by-diagram.
- ACM Queue - Practitioners writing for practitioners; excellent for calibrating your mental model of modern systems.
- USENIX (OSDI/NSDI/FAST) - Primary sources for distributed systems design; when you want actual data instead of "best practice" folklore.
- Papers We Love - A well-curated on-ramp into classic systems papers; useful for building a personal reference spine.
- Architecture Decision Records (ADR) - A lightweight, durable way to make architectural intent legible over time (and to capture why you said "no" to the other options).
- Martin Fowler - Delivery and architecture essays; treat as lenses, not scripture.
AI engineering toolchain
- OpenAI Platform Docs - Core API patterns (tool use, structured outputs, streaming), safety guidance, and the constraints that matter when you move beyond demos.
- OpenAI Cookbook - Practical integration patterns; best used as a menu of implementation sketches to adapt, not copy.
- Anthropic Docs - A good comparative reference for tool use and message design; cross-checking vendors is a fast way to reveal hidden assumptions.
- LangGraph - Agent orchestration as explicit state machines; helpful when "a chain" becomes a workflow with retries, branches, and audits.
- LlamaIndex - Retrieval plumbing and connectors; useful when your system boundary is "knowledge," not just prompts.
- LiteLLM - Provider abstraction and gateway patterns; reduces lock-in and makes multi-model routing and cost control less bespoke.
- vLLM - High-throughput inference for open models; a practical default when you care about batching, latency, and GPU utilization.
- Ollama - Local model runner for fast iteration; great for prototyping agent behavior without burning cloud cycles.
- Transformers (Hugging Face) - The reference implementation ecosystem for open models; also a reality check on what "just run it" actually entails.
- pgvector - Vector search inside Postgres; a strong option when operational simplicity beats specialized infra.
- Qdrant - Vector database with clear operational docs; good when you need filtering, payloads, and predictable retrieval behavior.
- OpenTelemetry - If it matters, instrument it: unify traces/metrics/logs across orchestration, retrieval, and model calls.
- LangSmith - Tracing, dataset-driven evaluation, and regression testing for agent workflows.
- Phoenix (Arize) - Open-source observability/evals for LLM apps; useful when you want transparency and local control.
- Weights & Biases - Experiments, artifacts, and operational visibility; handy when you treat prompts/config as versioned assets.
- promptfoo - A pragmatic evaluation harness; makes prompt/model changes behave more like real software changes (diffs, baselines, regressions).
- Ragas - RAG evaluation metrics and pipelines; a starting point for measuring retrieval quality beyond vibes.
- OpenAI Evals - An eval harness you can read end-to-end; useful for designing your own regression suite and scorer patterns.
- FAISS - Core vector search library; worth knowing even if you never run it directly, because many stacks embed its assumptions.
- Introduction to Information Retrieval - Retrieval fundamentals (scoring, ranking, evaluation) that make RAG systems more measurable and less mystical.
- Retrieval-Augmented Generation (Lewis et al., 2020) - The canonical framing for RAG; useful for vocabulary, baselines, and conceptual boundaries.
- ReAct (Yao et al., 2022) - A clean description of reasoning + tool-use loops; the ancestor of many agent orchestration patterns.
- OWASP Top 10 for LLM Applications - Threat modeling and common failure modes; use it to design guardrails that survive adversarial input.
- NIST AI Risk Management Framework - A structured way to talk about AI risk with grown-ups (security, safety, governance) without collapsing into theater.
Writing and research workflow
- Obsidian - Local-first knowledge base for long-form drafting; excellent when you treat notes as a system (links, maps, refactoring).
- Readwise - Capture highlights and run a spaced review loop; good for turning "I read it" into "I can reuse it."
- Zotero - PDF/library management for papers and specs; keeps citations, tags, and search sane once your library stops fitting in your head.
- Better BibTeX for Zotero - Stable citation keys and export workflows; useful if you ever want your references to be reproducible.
- Pandoc - The universal format converter; helpful for publishing pipelines and for keeping content portable over time.
- Mermaid - Diagrams as text; works well when architectural diagrams should be versioned alongside code.
- Raycast - Automation surface for repeatable workflows; small time savings that compound if you actually use them.
If you have a resource recommendation that consistently improves technical outcomes, send it through the Contact page.