Rearchitecting a Shopping List Chatbot: Infrastructure, Bugs, and Lessons

By RJ Militante · May 2026

Building an AI-powered shopping assistant sounds straightforward until you get into the details: which model, which vector database, how to handle embeddings, where chat history lives, how streaming works across a RAG pipeline. Each decision has tradeoffs, and several of them bit us during implementation.

This post covers the architecture decisions and the specific technical problems we ran into building shoppinglist-chat — a FastAPI backend with RAG over a product catalog and a Next.js frontend with streaming chat.

Starting point

An earlier version of this service had been running in an Azure development environment. When we moved off that environment, it became an opportunity to revisit the entire codebase rather than do a straight lift-and-shift — tighten the architecture, replace components that were underperforming, and get it running on infrastructure we actually control.

The infrastructure decisions

Dropping Ollama for DO Serverless Inference

The VPS is a small DigitalOcean droplet — no GPU, constrained RAM. Running a local LLM was off the table. The options were to upgrade the droplet, rent a GPU droplet, or use a hosted inference API.

DigitalOcean Serverless Inference provides OpenAI-compatible endpoints for a range of models, billed per token, with no infrastructure to manage. The model we landed on: llama3.3-70b-instruct. The cost for development-level usage is negligible, and the API is a drop-in for the OpenAI Python SDK:

from openai import AsyncOpenAI

client = AsyncOpenAI(
    api_key=settings.do_inference_api_key,
    base_url="https://inference.do-ai.run/v1",
)

One note on model naming: the DO dashboard lists models with different capitalisation and separators than the API accepts. We discovered this at runtime — llama3.3-instruct-70b returns a 404, llama3.3-70b-instruct works. Always verify by listing models via the API, not by copying the dashboard label.

Qdrant Cloud for vector storage

The original vector search was pointed at a local instance. We replaced it with Qdrant Cloud — a managed vector database with a free tier, accessible from anywhere.

Configuration: cosine similarity, 384 dimensions (matching the embedding model), score threshold of 0.3 for RAG retrieval:

from qdrant_client import AsyncQdrantClient

_qdrant = AsyncQdrantClient(url=settings.qdrant_url, api_key=settings.qdrant_api_key)

results = await _qdrant.query_points(
    collection_name="products",
    query=embedding_vector,
    limit=5,
    score_threshold=0.3,
)

The collection holds the product catalog — ingested once via script, not modified at runtime.

DO Inference for embeddings too

Rather than running a local embedding model or spinning up a separate service, we used DO Inference's all-mini-lm-l6-v2 endpoint for embeddings. Same API key, same base URL, no additional infrastructure:

response = await client.embeddings.create(
    model="all-mini-lm-l6-v2",
    input=query,
)
vector = response.data[0].embedding  # 384 dimensions

The tradeoff is an extra network hop per RAG query — acceptable at this scale.

SQLite instead of PostgreSQL (for now)

The original Postgres dependency was replaced with SQLite via aiosqlite. For a POC with a single deployment, this is the right call: zero configuration, file-based, async-compatible, trivially backed up:

import aiosqlite

async def init_db() -> None:
    db = await aiosqlite.connect("chat_history.db")
    await db.execute("""
        CREATE TABLE IF NOT EXISTS messages (
            id       INTEGER PRIMARY KEY AUTOINCREMENT,
            session  TEXT NOT NULL,
            role     TEXT NOT NULL CHECK(role IN ('user', 'assistant', 'system')),
            content  TEXT NOT NULL,
            created  DATETIME DEFAULT CURRENT_TIMESTAMP
        )
    """)
    await db.execute(
        "CREATE INDEX IF NOT EXISTS idx_messages_session ON messages(session)"
    )
    await db.commit()

The migration path to PostgreSQL is straightforward when auth and multi-tenancy become requirements.

Technical hurdles

Dependency incompatibility on first run

The first actual error wasn't a logic problem — it was openai==1.51.0 conflicting with the version of httpx already on the system. The proxies argument to the HTTPX client was removed in a newer version, and the old OpenAI SDK still used it. Fix: pin openai>=1.57.0 in requirements.txt.

AsyncQdrantClient.search() doesn't exist

The vector search code from the old POC called client.search(). In qdrant-client 1.18.0, that method no longer exists on the async client. The replacement is query_points(), which has a different signature. This kind of silent API breakage is exactly what happens when a codebase sits untouched while its dependencies move on.

# Old — breaks on qdrant-client >= 1.9
results = await client.search(collection_name="products", query_vector=vector, limit=5)

# New
results = await client.query_points(collection_name="products", query=vector, limit=5)

Streaming: ChunkEvent vs. delta chunks

The old chat streaming used client.chat.completions.stream(), which returns ChunkEvent objects — not the standard delta chunks you get from create(stream=True). Switching to create(stream=True) and iterating chunk.choices[0].delta.content is the correct pattern for OpenAI-compatible APIs:

stream = await client.chat.completions.create(
    model=settings.do_model,
    messages=messages,
    stream=True,
)

async for chunk in stream:
    delta = chunk.choices[0].delta.content if chunk.choices else None
    if delta:
        yield delta

The RAG hallucination bug

This one cost the most time. The LLM was recommending products that didn't exist in the catalog — hallucinating freely despite the system prompt explicitly prohibiting it. The root cause wasn't the prompt.

The RAG skip threshold was set to 8 words: messages shorter than 8 words would bypass vector search entirely:

_RAG_SKIP_THRESHOLD = 8  # bug: too aggressive

def _should_skip_rag(message: str) -> bool:
    return len(message.split()) < _RAG_SKIP_THRESHOLD

"What wound care products do you have?" is 6 words. RAG was never running. The model was answering from its training data with no catalog context, and doing it confidently.

Two fixes: lower the threshold to 3 (skips only single-word greetings and acks), and strengthen the system prompt:

_RAG_SKIP_THRESHOLD = 3

SYSTEM_PROMPT = (
    "You are a helpful medical supply assistant. "
    "You ONLY recommend products from the catalog provided in the context. "
    "NEVER invent, hallucinate, or reference products not explicitly listed in the context. "
    "If no relevant products are found, say so honestly rather than suggesting products from memory. "
    "Always cite the exact product name and ID from the catalog when making recommendations."
)

The lesson: when an LLM ignores grounding instructions, check whether the grounding is actually being provided before debugging the instructions.

Circular import between services

vector.py needed to call the OpenAI client for embeddings (defined in llm.py). llm.py needed to call vector search (defined in vector.py). Standard circular import.

Fix: lazy import — get_openai_client() is imported inside the search_products() function body rather than at module level. Python resolves it at call time, after both modules are fully loaded:

# vector.py
async def search_products(query: str) -> list[dict]:
    from app.services.llm import get_openai_client  # lazy — avoids circular import
    client = get_openai_client()
    ...

What the new architecture looks like

Backend: shoppinglist-chat (FastAPI)

app/
  config.py        — pydantic-settings, all constants centralised
  main.py          — FastAPI app, lifespan initialises all services
  routes/
    chat.py        — POST /chat/{session_id}, GET /history/{session_id}
    search.py      — POST /search (direct vector search, no LLM)
  services/
    llm.py         — OpenAI client, RAG integration, streaming
    vector.py      — Qdrant client, embeddings, product search
    history.py     — SQLite chat history via aiosqlite
  models/
    schemas.py     — Pydantic request/response models

RAG flow: embed user query → Qdrant similarity search → format top-K products as context → prepend to user message → stream LLM response. Sources are emitted to the frontend as a sentinel line before the text stream, so the UI can display which catalog products were used to generate each response:

if products:
    sources = [{"id": p["product_id"], "name": p["name"]} for p in products]
    yield "__SOURCES__:" + json.dumps({"sources": sources}) + "\n"

async for chunk in stream:
    delta = chunk.choices[0].delta.content if chunk.choices else None
    if delta:
        yield delta

Frontend: shoppinglist-chat-ui (Next.js 16)

Two tabs: Chat (streaming with source attribution) and Product Search (direct vector search results). Clerk handles auth — the Clerk user ID replaces the random session ID, so chat history is tied to a real user account. API calls are proxied through Next.js's proxy.ts so only port 3000 needs to be exposed, not the FastAPI port.

The two repos deploy independently: FastAPI on the VPS behind nginx, Next.js on Vercel. Different deployment targets, different change cadences — keeping them separate was the right call.

What's still ahead

PostgreSQL migration. When auth is fully in place, SQLite becomes a liability. Supabase gives us Postgres and auth in one.
Stripe. Subscription tiers — free tier with a daily message limit, paid tier for unlimited access. Webhooks handled on the FastAPI side.
Production domain. nginx config is written and sitting in deploy/ waiting for a domain decision.
Real product data. The current catalog is synthetic — 90 generated medical supply items for development. Production needs real SKUs.

The actual insight

The rearchitecture didn't require rewriting the logic. The chat routing, the RAG pattern, the streaming approach — all of it was worth keeping. What changed was the infrastructure layer: every local dependency replaced with a hosted equivalent, every assumption made explicit in config, every service initialised at startup and shared cleanly.

The result is something that runs anywhere with a .env file and a python run.py.

The backend repo is at github.com/rjthegreatxx/shoppinglist-chat. The Next.js frontend is at github.com/rjthegreatxx/shoppinglist-chat-ui.