Building an Autonomous Data Analyst Agent with FastAPI, LangChain, and Gemini
- Henil Diwan

- 1 day ago
- 10 min read
A walkthrough of a small but opinionated open-source project that turns a plain-text list of questions into JSON answers — chart images included.

.
The problem nobody seems to enjoy solving
Most data analysts have, at some point, opened a CSV someone emailed them, written ten lines of pandas to answer two questions, drawn a chart, screenshotted it, pasted it into a reply, and then watched the whole sequence repeat the next week with a slightly different file. The pattern is unmistakable: a small, mostly mechanical analysis hiding inside a thin layer of natural-language back-and-forth.
This is precisely the kind of work that benefits from an agentic LLM wrapper. The reasoning is not difficult; the plumbing is. You need to accept an arbitrary dataset, infer what to do with it, write correct Python against a schema you have only just seen, run that Python somewhere safe, and return a structured answer that includes images. None of that is conceptually hard — but each step has a thousand small ways to break in production.
The Data Analyst Agent is a compact open-source project that tackles exactly this loop. It is a FastAPI service that accepts a questions.txt file and an optional dataset, asks Google Gemini (via LangChain) to plan and emit Python, executes that Python in a subprocess sandbox, and returns a JSON object mapping each original question to its answer. Plots come back as base64-encoded PNGs under 100 KB. The frontend is a single, drag-and-drop HTML page.
In this post I want to walk through how the agent is built, why a few of its design decisions matter, and what generalises to other LLM-backed services you might build yourself.
A 30-second tour of the system
Before getting into specifics, here is the mental model:
Browser → FastAPI → Orchestrator → LangChain agent → Gemini → (optional scrape) → Sandbox subprocess → JSON response.
The web layer is intentionally thin. It serves a static HTML SPA at GET /, accepts multipart uploads at POST /api, and exposes a substantial diagnostics endpoint at GET /summary. There is no database, no queue, no session store. The agent runs synchronously inside the request handler with a wall-clock timeout, which is appropriate for a tool of this scale.
What makes the project interesting is not any single component — none of them are novel — but how they compose. A LangChain tool-calling agent emits a JSON blob containing both the original questions and a Python snippet; an orchestrator extracts the snippet, optionally pre-runs a scrape, then executes the code inside a fresh Python subprocess with a curated preamble. The subprocess is a deliberate isolation boundary: it has a timeout, its stdout is the JSON protocol, and the LLM's code never touches the FastAPI process's memory.
There is, in other words, a clear architectural seam between language model output and execution. That seam is the single most important design decision in the entire repository.

.
The agent loop, in plain English
The orchestrator's job is small but precise. When a request arrives:
Parse the questions file. A short utility called parse_keys_and_types looks for an optional schema spec at the top of the file (lines that look like - `total_revenue`: number) and produces an ordered list of keys plus a per-key casting function. This is what lets a consumer of the API say "return me back a {total_revenue: float, top_category: str} payload" rather than echoing the question strings.
Decide whether the agent may scrape. If the user uploaded a dataset, the prompt rules forbid the scrape tool and instruct the model to use the in-memory DataFrame df. If they did not, the rules permit calls to scrape_url_to_dataframe(url). This branching is the only meaningful prompt customisation in the entire pipeline.
Invoke the agent. LangChain's AgentExecutor is given a ChatPromptTemplate, a single @tool-decorated function, and the wrapped LLM. The agent is capped at three iterations — agents that ramble on more than that almost never recover, in my experience.
Clean and parse the LLM output. A small clean_llm_output helper strips triple-backtick fences, walks the string for the outermost balanced JSON braces, and falls back to a backwards scan if json.loads chokes on the first candidate. This is the kind of code you write twice and then copy into every LLM project for the rest of your life.
Pre-run the scrape, if any. If the generated code contains a scrape_url_to_dataframe("...") call, the orchestrator extracts the first URL, runs the scrape in the parent process, pickles the resulting DataFrame to a temp file, and tells the sandbox to load it.
Execute the code. The sandbox returns a {"status": "success", "result": {...}} JSON blob on stdout. The orchestrator maps the result dict back to the original questions and applies type casting if a schema was present.
It is a short loop, and yet every step is doing something load-bearing. The schema parsing makes the API consumable. The dataset-vs-scrape branching keeps prompts terse and predictable. The output cleaner accepts the small messes that real models produce. The subprocess sandbox is the wall between generated code and production process.

Designing for flaky LLMs: the fallback layer
One of the realities of building on a hosted LLM is that you do not own the rate limits, the model availability, or the per-key quota. The project's answer to this is a class called LLMWithFallback that wraps ChatGoogleGenerativeAI and implements a deliberately boring algorithm.
The class reads up to ten environment variables (gemini_api_1 through gemini_api_10) and a hardcoded model hierarchy that goes from gemini-2.5-pro down to gemini-2.0-flash-lite. When the agent asks for an LLM instance, the wrapper walks the model axis first and the key axis second. It instantiates a ChatGoogleGenerativeAI for each pair, and if construction succeeds, returns it immediately. If the construction throws and the exception message contains words like "quota", "rate limit", "403", or "too many requests", the wrapper records the key in a slow_keys_log. After a half-second sleep, it moves on.
A few details are worth flagging:
The model order is preserved. The wrapper does not try every key on the cheapest model first; it tries every key on the best available model first, only stepping down when no key works there. This is the opposite of cost-minimising behaviour, but it is the right call for a user-facing tool where output quality matters more than per-call price.
The slow-key tracking is informational, not blocking. A "slow" key is still attempted on the next request. This avoids the pathological case where a transient 429 permanently quarantines an otherwise healthy key, but it does mean retries are not skipped.
The fallback runs inside bind_tools and invoke. That is how the wrapper plugs into LangChain's create_tool_calling_agent without the agent needing to know anything about the fallback logic.
In practice, this is enough to make the service feel resilient. The cost is roughly one network round-trip per failed key, which is well within the request budget. The reward is that a single revoked key, or a Gemini partial outage on the 2.5 family, no longer takes the service down.

Executing untrusted-ish code: the subprocess sandbox
The most consequential design choice in the project is that the LLM's output is never executed in the FastAPI process. The orchestrator writes a temp .py file, assembles a curated preamble in front of the model's snippet, and invokes subprocess.run([sys.executable, tmp_path], capture_output=True, text=True, timeout=...).
The preamble does three things:
Imports the libraries the model is allowed to assume exist — pandas, numpy, matplotlib (with the headless Agg backend), base64, json, and PIL.Image if Pillow is installed.
Loads the injected DataFrame from a pickle path if the orchestrator decided one was needed, exposing both df and data = df.to_dict(orient='records').
Defines two helpers that the model is told it can call: a self-contained scrape_url_to_dataframe (so the model can scrape inside the sandbox without a network round-trip via LangChain) and a plot_to_base64(max_bytes=100_000) helper for charts.
The model's job is to populate a top-level results dict where the keys are the exact question strings. The sandbox ends with print(json.dumps({"status":"success","result":results}, default=str), flush=True), which makes stdout an unambiguous JSON protocol that the orchestrator can json.loads() directly.
Calling this a "sandbox" with a straight face requires a caveat. It is process isolation, not security isolation. The subprocess can still write to disk, hit the network, or shell out. What it cannot do is corrupt FastAPI's memory, hang the event loop indefinitely (the timeout enforces that), or exfiltrate Python objects the parent did not explicitly serialise. For a self-hosted tool, that is usually the right point on the curve. For a multi-tenant SaaS, you would want to put a container, a seccomp profile, or a WASM runtime around this.
What I particularly like is the protocol. The orchestrator does not import the LLM's code; it does not even read its locals. The sandbox produces a single JSON line on stdout, and that is the entire contract. Everything else — stack traces, deprecation warnings, matplotlib's noise — is stderr and gets relayed in error messages but never trusted.
A small helper that does a surprising amount of work: plot_to_base64
Models are very good at producing matplotlib code and very bad at producing small matplotlib code. They will happily emit a 1600×1200 figure at 200 dpi that base64-encodes to 800 KB, which is comfortably too large for most JSON consumers to be happy about. The project's answer is a small helper bundled into the sandbox preamble.
The helper first tries the user's chart at dpi=100. If the resulting PNG fits under the byte budget, it is returned. Otherwise, the helper walks down a small dpi ladder — 80, 60, 50, 40, 30 — saving and remeasuring each time. If the PNG still does not fit and Pillow is available, the helper re-saves at dpi=40, opens the buffer with PIL, and converts to WEBP at quality 80, then quality 60. WEBP is typically a third the size of PNG for the same visual quality, which is enough to rescue the long tail. As a last resort, the helper returns a dpi=20 PNG even if it exceeds the limit — because returning a too-large image is still better than returning nothing.
There is no machine learning here, no perceptual quality measurement, no clever subsampling. It is twenty lines of ifs. It works because the failure modes of large matplotlib figures are well-understood, and because the cost of an unnecessary round of savefig is negligible compared to a failed request.
This is, frankly, the helper I find myself happiest with. It is one of those bits of code where the right amount of pragmatism does the work that a more clever solution would have done badly.

The diagnostics endpoint
Most internal tools have a /health that returns {"ok": true} and stops there. This project has a GET /summary that runs about ten checks concurrently and returns a structured report. It probes the environment for the expected variables, fingerprints the system (host, OS, Python version, CPU, RAM, free disk, optional CUDA), tries to write to both the temp directory and the working directory, runs a tiny pandas pipeline, lists the first twenty installed distributions, fires HEAD requests at Gemini / AI Studio / OpenAI / GitHub, and — most usefully — pings every (key, model) pair using the same model hierarchy as the runtime fallback.
The key-by-model probe is the part you actually want when debugging a production incident. It tells you which keys are alive, which models accept them, and what the failure mode is for the rest. When the service goes down because of a Gemini regional outage, the diagnostics report tells you that within seconds, and you stop wondering whether your code is broken.
A full=true query parameter opts you into two heavier checks — a DuckDB connect-and-query, and a Playwright browser launch — that are useful when you suspect dependency rot but that are too slow to run on every health check.
The pattern I want to draw out here is that diagnostics endpoints are an excellent first place to put new operational knowledge. Every time you debug an incident manually, ask whether the check you just ran would be valuable as a permanent diagnostic. Most of the time the answer is yes, and GET /summary is where it should live.
[Image: components.png] Figure 6 — A module-level component view, grouping the codebase into Web, Agent, LLM, Tools, Sandbox, and Diagnostics layers. The same LLMWithFallback is shared between the runtime agent and the diagnostics probe.
What I would do next
The project is small enough that its next steps are easy to enumerate. The most valuable additions, in roughly the order I would build them:
Multi-URL scraping. The orchestrator currently extracts only the first URL it finds in the generated code. Walking the full list and pickling each as df_1, df_2, … would unlock a class of multi-source questions the agent cannot answer today.
Cached LLM responses. Identical (questions_file, dataset) pairs should not require a fresh LLM call. A content-hash cache, even an in-memory one, would dramatically reduce both latency and cost during repeated experimentation.
A real sandbox. Switching from subprocess to a Docker exec, Firecracker microVM, or Pyodide runtime would make the project genuinely safe to expose to the public internet.
Streaming output. The current API blocks until the entire result dict is ready. For multi-question files, the user would benefit enormously from an SSE or WebSocket variant that streams each answer as the sandbox emits it.
None of these are difficult; all of them would make the service more useful.
Closing thoughts
The interesting thing about building agentic LLM services in 2026 is that the model is increasingly the easy part. Gemini is excellent at writing pandas. What separates a working tool from a flaky one is the surrounding plumbing: the prompt rules that keep the model focused, the output parser that survives small malformations, the fallback layer that smooths over transient provider failures, the sandbox that makes generated code safe to execute, and the diagnostics endpoint that tells you what is happening when it isn't.
The TDS Data Analyst Agent is a small, readable instance of all of those patterns. If you are building something similar — whether for analytics, document processing, or any other "natural language in, structured data out" workflow — the project is worth fifteen minutes of source reading. The full code is in app.py; the architecture, sequence, fallback, and component diagrams in the docs/ folder will tell you the rest.




Comments