Arnav Amal Ray ← Back to portfolio
Enterprise AI architecture · April 2026 · 11 min read

Why SAP API Policy v4 Makes Clean Core a Practical Prerequisite for Enterprise AI

A new SAP policy reads like developer governance. Read against the agentic AI shifts of the last eighteen months, and against what SAP Business AI actually ships, it is something else.

Update — May 2026. SAP has published a Frequently Asked Questions document (version 1.1, May 2026) clarifying several questions raised by enterprise customers since the original April announcement. The substance of the policy has not changed. What has changed is the framing.

Reading the FAQ, what comes through is that SAP is drawing a sharper line between three things the market has been conflating. The API Policy is contractual governance over how SAP APIs are consumed. Clean Core is an architectural discipline customers asked for, aimed at keeping custom code upgrade-safe. The endorsed architectures are the SAP-sanctioned pathways for AI, integration, and data, built on open standards. The April announcement was the policy itself. The May FAQ separates the policy from Clean Core and from the endorsed architectures, and clarifies what each one does.

Two implications of that separation matter for the argument that follows.

The policy does not dictate platform choice. Customers retain independent choice on three layers: the AI orchestration platform, the data platform, and the integration platform. Any AI platform can connect to SAP through endorsed pathways like A2A via Joule and Agent Gateway. Any data platform can receive SAP data through BDC Connect. The constraint is on how the connection happens, not which vendor sits on the other side.

Clean Core remains separate from the policy in formal terms. SAP frames Clean Core as customer-driven architectural hygiene, not as a contractual obligation. The argument here is that Clean Core is nonetheless the practical prerequisite for in-system agentic AI, because an upgrade-safe core is what makes Joule agents and MCP-connected external agents reliable on customer data. The formal route to that conclusion runs through architecture, not through the policy.

In April 2026, SAP published API Policy v4/2026. On the surface it reads as a developer governance update. Only APIs published on the SAP Business Accelerator Hub or in product documentation can be consumed. Internal, private, and non-published APIs are explicitly off-limits. No global grace period. No carve-out for systems on Public Cloud, Private Cloud, or BTP.

It is being read as a policy update. Read against what has happened in agentic AI over the last eighteen months, it is also one of the more concrete answers SAP has given to a question every CIO is asking: what does enterprise-grade agentic AI actually look like when the abstractions hit the ground?


A reading frame: Pull and Push

I think about the agentic AI market through a frame I call Pull and Push. It is a personal lens, not industry vocabulary.

On the Pull side sit the model providers. Anthropic, OpenAI, Google. Their strategy is to put the frontier model at the centre and pull apps, data, and tools inward through connectors, agents, and the Model Context Protocol. The user’s home is the chat surface, the API, the agent loop.

On the Push side sit the application incumbents. SAP, Salesforce, Adobe. Their strategy is to keep the application at the centre and push frontier models into the platform as embedded copilots and domain agents. The user’s home stays inside the system of record. Data gravity is the moat.

Microsoft is the most interesting case the frame does not fully capture. Through its OpenAI investment it is a Pull-side participant. Through Office, Dynamics, and Teams it is a massive Push-side incumbent. Most of the strategic moves you will see from Microsoft in 2026 read as both, depending on which surface you are standing on. The frame is a lens, not a taxonomy.

API Policy v4 is the most disciplined Push move I have seen in 2026. It does not announce a new feature. It defines what is allowed inside the system.

Pull and Push: how AI platform power is splitting in 2026

Pull → model as platform

WhoAnthropic · OpenAI · Google
StrategyModel at centre. Apps, data, tools pulled in via MCP and agents.
User homeChat surface, API, agent loop.

← Push, app as platform

WhoSAP · Salesforce · Adobe
StrategyApplication at centre. Models pushed inside as embedded copilots.
User homeSystem of record. Data gravity is the moat.

Where they meet: governed protocols

MCPOpen wire protocol for agents
SAP API Policy v4Push side door policy
GovernanceAudit trail & data contracts
Pull: model providers
Push: application incumbents
Shared governance layer

What the policy actually says

The policy formalises something SAP has been signalling for two years. New developments and extensions consume only Published APIs. Non-compliant integrations are now considered at-risk, because SAP reserves the right to change or decommission internal APIs in any release cycle without notice.

Where a published API does not exist, the recommendation is to wrap the gap on BTP, request the API through the SAP Customer Influence portal, or wait.

This is not a surprise to anyone who has been tracking the Clean Core message since 2024. What is new is that the language has hardened. Clean Core is no longer aspirational guidance. It is contractual.


Why the timing matters

Read in isolation, the policy is a maintenance announcement. Read against what happened on the Pull side between November 2024 and April 2026, it is something else.

In that window, the Model Context Protocol moved from a single-vendor specification to a widely adopted standard for how AI agents connect to enterprise systems. OpenAI adopted it in early 2025. Google DeepMind followed. Microsoft and AWS shipped MCP support across their stacks. The protocol was donated to the Linux Foundation’s Agentic AI Foundation in late 2025. Thousands of public MCP servers exist as of early 2026. Competing protocols and vendor-specific extensions still exist, but MCP is the closest thing the agentic ecosystem has to a common wire protocol today.

The practical consequence is that any agent can now knock on any door. The cost of integrating an AI workflow with a business system has collapsed. What used to take a custom connector now takes a server registration.

That is precisely why SAP’s policy matters. When the cost of integration falls, the question of which integrations are sanctioned becomes the question. Policy v4 is the Push side’s door policy for the agentic era.


Clean Core in four levels

SAP has been refining the extensibility model alongside the policy work. The current framing classifies extensions across four levels.

Level A uses only released APIs and is fully upgrade-safe. It is the target for all new development. Level B uses classic APIs such as BAPIs, permitted with governance approval. Level C reaches into internal objects, carries upgrade risk, and requires a remediation plan. Level D covers direct modifications and is a retirement candidate.

Read this carefully and the AI implication is unambiguous. A customer running mostly Level A is ready for Joule agents and for Pull-side agents that connect through MCP, because the surface those agents touch is contractually stable. A customer running mostly Level C and D is not ready, regardless of how good the model is, because the agent’s behaviour cannot be guaranteed across an upgrade.

One qualification matters here. AI on enterprise data does not only run through transactional APIs. Real production AI also runs through data lakes and replicas, event streams, semantic layers, and embeddings sitting alongside the system of record. A company with a messy core but a strong data platform can deploy useful AI workflows on its data without ever calling an SAP transaction. Clean Core is the prerequisite for in-system agentic action specifically, not for AI value broadly. The two are easy to conflate and worth keeping separate.

For in-system action, though, Clean Core is no longer a code quality conversation. It is the prerequisite.


Where Joule stands in 2026

The Push strategy is not theoretical. SAP has been building it for two years. Joule is the umbrella brand spanning a conversational assistant, separate workspaces for developers and consultants, a library of pre-built skills, and an orchestration layer for multi-step agents. Joule Studio lets enterprises and partners build custom agents grounded in their own SAP context.

The architectural pieces are credible. Adoption is the part that is lagging. Joule is generally available on RISE with SAP and GROW with SAP, with limited or no parity on on-premise installations. Pricing runs through SAP AI Units with a minimum package that puts a floor on entry cost regardless of usage. Mainstream maintenance for ECC 6.0 ends in 2027, with extended maintenance available through 2030. Migration timelines are concrete, but they are still migration timelines. Customers who have not moved by now are unlikely to be the earliest Joule adopters once they do.

What this means for the argument so far is straightforward. The Push strategy is well-architected and well-funded. Push adoption is mostly forward-looking. When this post argues that API Policy v4 makes Clean Core a procurement criterion, it is describing the architectural decision the next eighteen months will turn on, not the state of the installed base today.


Collaboration through governed protocols

The lazy framing for the rest of 2026 will be that Pull and Push are at war. That framing misses what is actually happening.

SAP itself is shipping MCP-compatible interfaces. An ABAP MCP server is in the 2026 roadmap for Joule for Developers. The agent-to-agent protocol that underpins SAP AI Agent Hub is explicitly designed to work across SAP and non-SAP agents. Joule is being positioned to operate alongside, not in opposition to, agents built on the Pull side. The generative AI hub in SAP AI Foundation now offers Claude Opus, Claude Sonnet, GPT, and Gemini as model choices for grounding Joule on customer-specific contexts.

The pattern that is emerging is not generalist orchestrators against sovereign systems. It is generalist orchestrators talking to sovereign systems through standards that both sides accept.

API Policy v4 is the Push side of that conversation. MCP is the Pull side. Read together, they describe a future in which the agent does not need to understand SAP. It needs to understand which doors are open and what governance applies on the other side. The work of designing that interface, deciding which agent owns which decision, where the audit trail lives, and how data leaves and returns, is not going away. It is the work.


What it means in practice

For finance and operations leaders evaluating AI vendors, the first question is no longer about the model. It is whether your S/4HANA system is on a Clean Core trajectory. If it is not, the in-system AI roadmap is constrained before the procurement conversation starts.

For SAP architects, Level A has graduated from preference to procurement criterion. Custom code assessments and remediation plans are now AI readiness plans by another name.

For startups and product teams building on top of SAP, the policy is both a constraint and a clarification. The published API surface defines the addressable problem. What that means in practice is harder than it sounds.

Getting an API published is not a shortcut. The official path runs through the SAP Customer Influence portal, where requests need to clear a voting threshold to be considered. Pass rates are low, delivery is not guaranteed even when a request clears, and the cycle from accepted request to released API can run several quarters. Building a roadmap around a future API is a slow bet.

The pragmatic options for a startup automating SAP S/4HANA with proprietary tools are roughly four.

Map first, build second. The SAP Business Accelerator Hub catalogues thousands of released APIs across Finance, Procurement, Supply Chain, and HR. Many capabilities that look missing already exist under unfamiliar names. The first move is a serious mapping exercise against this catalogue. It costs a week and saves quarters.

Wrap the gap on BTP. Where no released API covers the use case, deploy your logic as a BTP-native extension using ABAP Cloud or CAP. The wrapper consumes whatever SAP exposes and presents your product with a stable contract. This moves non-compliance out of the SAP core into a managed layer that survives upgrades. It is the architecture SAP itself recommends for this exact gap.

Become a partner. The SAP PartnerEdge program and the SAP Integration and Certification Center give certified partners a defensible path to integrate, distribute through SAP Store, and co-sell. Partner status is also how a startup gets product management attention on missing APIs faster than the public voting queue.

Engage co-innovation. The SAP Customer Engagement Initiative offers a slower but genuine route into product roadmaps. Useful for startups with deep domain expertise SAP does not currently cover, where the conversation is about strategic fit rather than feature voting.

The SAP-as-platform reality has not changed. What has changed is that fighting it is now explicitly unsupported, while working with it is more clearly documented than at any point in the last decade.

If you are running an AI strategy review in 2026 and Clean Core has not entered the conversation, you are reviewing the wrong question.

© 2026 Arnav Amal Ray. All rights reserved.

AI / RAG · March 2026 · 12 min read

When AI gets immigration law wrong, the cost is not a bad essay

How I built a local AI legal assistant for German immigration law, what it took to get it right, and why “good enough” was never an option.

There is a moment most foreigners living in Germany will recognise. You are trying to understand a statutory condition for your visa or residence permit. You have the official government page open in one tab. You have a forum thread from 2021 open in another. You have asked a general-purpose AI assistant, which gave you a confident, well-structured answer citing a section of law that was amended two years ago.

None of these three sources agrees with the others.

I came from India, did my MBA in Germany, and stayed to build a career in Frankfurt. The immigration system I have navigated is genuinely complex. German residence law spans multiple statutes, each with distinct scopes and cross-referencing conditions. It is updated by legislation. And it is misrepresented constantly, both by outdated internet sources and by AI systems trained on data that predates the most recent reforms.

I built this project because I needed it. I also built it because the problem it addresses does not stop with my personal situation.


The obvious question

The first thing anyone asks when I describe this project is why I did not just upload the law to Claude or ChatGPT and ask my question.

It is a fair question. A frontier model with a large context window can ingest a legal document and answer questions about it reasonably well. For a one-off query, that is a perfectly rational approach.

But it is not the right architecture for anything that needs to be trusted, repeated, or used at scale. Here is why.

Frontier models hallucinate legal citations. Not occasionally. When legal text is complex, cross-referenced, and recently amended, a general-purpose model will sometimes generate section numbers that look plausible, cite provisions that have been replaced, or conflate two different legal regimes. The model is confident. The output looks correct. The error only surfaces when someone checks the actual statute.

A wrong answer in an immigration query is not a minor inconvenience. It can mean a rejected application, a missed condition, or a misunderstood right. A rejected application does not just cost time. For an employer, it means legal fees, delayed start dates, and direct financial liability. For the individual, it can mean months lost and a disrupted life plan. The stakes are simply not the same as getting a recipe slightly wrong.

Routing personal data to a cloud API is a choice with consequences. An immigration question carries nationality, employment status, family situation, and income details. These are sensitive personal data under GDPR. Sending them to a US-based cloud API in a casual query is a data governance decision that most people make without realising they are making it.

A tool you cannot explain is a tool you cannot trust. If the system gives you an answer, you need to be able to see exactly which section of law it drew from, read the German text yourself, and verify it. A general-purpose model gives you a response. This system shows you its work.

Those four constraints together defined the architecture before a single line of code was written.


The design brief

The requirements that came out of that thinking were straightforward.

The system had to run entirely on local hardware. No query, no document, and no personal detail should ever leave the machine. This is not a performance optimisation. It is a data architecture decision grounded in GDPR, professional liability, and the basic principle that a legal query deserves the same confidentiality as a legal consultation.

The system had to be grounded in current statute text rather than trained on historical summaries. The actual law was downloaded directly from the German government’s statutory publication source, indexed, and made retrievable at the section level. When the law changes, you rebuild the index. The system’s knowledge does not drift.

Every answer had to show the source. I did not want a footnote or a general reference to the relevant act. The exact German text of the section it drew from needed to be visible alongside the answer, so the user could read it themselves.

The scope also had to be honest. German immigration law covers three statutes: the general residence act for third-country nationals, the free movement act for EU citizens, and the employment ordinance for work permit approvals. These are the three laws in the index. Questions about citizenship, asylum, or social benefits are outside scope. The system explicitly states its limitations when asked, rather than inventing an answer from adjacent knowledge.

That last point matters more than it sounds. A system that knows its limits and communicates them clearly is much more trustworthy than one that attempts to answer everything.


Architecture decisions

The core architecture is a retrieval-augmented generation pipeline. The concept is straightforward. Instead of relying on a model’s training memory, you retrieve the relevant text from a controlled corpus first. You then generate a response grounded only in what was retrieved. The model is not guessing. It is reading.

Chunking at the section boundary. Legal text cannot be chunked the way prose is chunked. A statutory condition often spans multiple subsections within a single paragraph. Requirements for a given permit type might state the qualification condition in one subsection, the salary threshold in another, and the exceptions in a third. If a chunk boundary falls between them, any retrieval that returns only one subsection is retrieving half a legal condition. The chunking strategy here splits the corpus at section headers. This guarantees one complete statutory section per chunk. The integrity of the legal unit is non-negotiable.

Cross-lingual retrieval. The corpus is in German. A user asking in English needs to retrieve chunks using German statutory vocabulary. The solution is a two-step approach. The query is first expanded into the German legal terms, section numbers, and statutory vocabulary that correspond to the question. That expanded query is what gets embedded and matched against the corpus.

Two-stage retrieval. The first stage is vector similarity, which is fast but approximate. It retrieves candidates based on embedding proximity, not necessarily on actual legal relevance. The second stage is a cross-encoder reranker. It takes the query and each retrieved candidate together and scores the genuine relevance of the pair. The top five sections from that second stage are what get passed to the language model. This two-stage approach was the single biggest driver of answer quality improvement in the project. It also adds seconds to the response time. This is a trade-off made deliberately. In a legal context, a user will wait fifteen seconds for a correct, private answer. They will not wait at all for a fast, wrong one.

Model choice is infrastructure, not identity. The language model is not the product here. It is just a component. The system is designed to run any locally-served model. The corpus coverage and retrieval quality remain constant regardless of which model generates the final response. The true intelligence of the system lives in the retrieval pipeline, not in the model weights.


How it works

Pipeline: offline build and per-query flow

Offline (build once)

Law corpus3 German statutes
Ingest & chunkone § per chunk
bge-m3 embedder1024-dim · multilingual
Vector store~370 chunks indexed

feeds index ↓

Per query: on local hardware, no data leaves the machine

User queryEnglish or German
Query expansionEnglish → German legal terms + statutory vocabulary
Ollama · qwen2.5:14b
Vector retrievalcosine similarity, retrieves top k candidate chunks
bge-m3
§ priority boostsections named explicitly in query surfaced first
Cross-encoder rerankingtrue query-answer relevance scored, top 5 selected
bge-reranker-v2-m3
If confidence is low, retrieval retries with expanded terms before proceeding
LLM generationtop-5 chunks as context, 6-turn conversation memory
Ollama · qwen2.5:14b
Cited answersource panel with exact German text  ·  RAG insight scores
dashed = offline / retry path
pill = input / output
amber = cited answer
model tag = active model
The German Immigration Law Assistant showing a query, the retrieved German statute text, and the RAG insight panel with retrieval scores.
The RAG insight panel (right) shows cosine similarity and reranker relevance scores for every retrieved section. USED and NOT USED markers show exactly which chunks reached the language model. The source panel below displays the exact German statute text the answer drew from.
A short walkthrough: query in English, retrieval from German statute, answer with cited source text.

The RAG insight panel exists for one reason: transparency. It shows similarity scores and reranking scores side by side for every retrieved section. A user should be able to see exactly why a section was or was not included in the answer. That visibility is a core part of the trust contract.


The quality gate

Building a system is one thing. Knowing whether it works is another. Testing it requires stepping outside the builder’s perspective entirely.

I ran the system through a structured cross-evaluation using a set of questions a real user would ask. I checked these against the actual statute text to look for where the answers failed. This was not a test of latency or token throughput. It was a test of legal accuracy. Does the answer cite the right provision? Does it state the conditions correctly? Does it avoid inventing requirements that do not exist in the retrieved text?

The results were instructive. For straightforward queries, the system performed well. Eligibility conditions, the distinction between different legal regimes for EU versus non-EU nationals, and the routing between different statutes all came back with accurate citations and appropriate legal framing.

The failures were also clear. In some cases, the retrieval returned the right general area of law but the wrong section. In a small number of cases, the system cited a reference that did not correspond to what it actually retrieved. These were retrieval failures more than generation failures, but the output was wrong either way.

Those findings shaped the next round of architectural decisions. I implemented tighter reranking thresholds, added a second retrieval pass when confidence was low, and placed more explicit guardrails in the generation prompt around citation discipline.

The principle this established is worth naming clearly. Accuracy evaluation in a regulated domain cannot be done by the builder alone. The test questions need to come from someone who knows the domain well enough to catch a plausible but wrong answer. Technical benchmarks measure technical performance. They do not measure whether the answer is legally correct.


What comes next

Retrieval-augmented generation with vector similarity search is effective for straightforward queries. It is not sufficient for complex legal reasoning.

The limitation is structural. A flat vector index retrieves sections based on semantic proximity to the query. It does not model how sections relate to each other. One provision might establish a general condition that another provision qualifies. A right established in one statute may be constrained by a cross-reference to another. A correct answer to a complex question sometimes requires reasoning across multiple sections in a specific sequence. Vector search alone cannot do that.

The next architectural layer will address this. A graph-based retrieval approach can explicitly represent the relationships between statutory sections. It tracks which sections cross-reference which, which provisions depend on conditions established elsewhere, and which legal regimes take precedence. By combining this with a reasoning agent that plans a multi-step retrieval strategy before generating an answer, the system can begin to handle the kind of multi-condition questions that currently expose the ceiling of the vector approach.

This is not a complete redesign. The existing pipeline remains intact: the chunking strategy, the embeddings, and the reranking logic. The graph layer will sit alongside the vector store, and a reasoning layer will decide which retrieval path is appropriate for each query. When you build the foundation right, extending it becomes a product decision rather than a total rebuild.


What this taught me about AI in regulated domains

Coming from a background in business and finance rather than software engineering, I had to approach this architecture from first principles. I asked not how to build each component, but what each component needed to achieve. That constraint turned out to be useful. A few things are worth stating directly, because they get lost in most AI project write-ups.

The model is not the product. The retrieval architecture, the chunking strategy, the evaluation framework, and the honesty about scope: those are the product. The language model is one swappable component.

Demo quality and production quality are not the same thing. A system that produces plausible-looking answers in a demo can fail systematic accuracy review. The only way to know is to test it against the domain. There is no shortcut.

Scope discipline is a feature. A system that answers only what it knows, and says so clearly when it does not know, is more valuable than one that attempts to answer everything. This is true for AI systems and for consulting advice.

Privacy-preserving architecture is not a constraint. It is a competitive position. In regulated industries, the ability to deploy AI that keeps sensitive data on-premise is becoming a prerequisite, not a differentiator. Building with that assumption from the start is the right approach.

This project started as a personal tool for a personal problem. It became a useful exercise in what it actually takes to build an AI system that can be trusted in a domain where being wrong has consequences.

One thing worth being transparent about: the system design, architecture decisions, and testing were mine. The Python implementation was built using Claude Code and Gemini as coding assistants, working from the same architectural brief described in this post. That is also part of the point. You do not need to write every line yourself to understand what the system is doing and why.

I am still learning and exploring this space. If you have built something similar, spotted a better approach, or have thoughts on the architecture or the direction, I would genuinely like to hear it.

View on GitHub →

© 2026 Arnav Amal Ray. All rights reserved.

AI / Productivity · March 2026 · 10 min read

Building the finance bot our family actually uses every day

How a simple frustration with household expense tracking led to a serverless AI bot, and what it took to make something genuinely reliable rather than just technically interesting.

Most finance tracking systems have the same failure mode. They work beautifully for the person who set them up and nobody else. The categories are unfamiliar. The interface requires deliberate effort. The habit breaks within a week. The spreadsheet becomes a monument to good intentions.

My household runs on two people with different spending patterns, different contexts for each purchase, and a shared but loosely managed sense of where money goes each month. We had tried apps. We had tried shared Google Sheets. We had tried a period of determined discipline that lasted about three weeks before both of us stopped updating things.

The problem was never motivation. It was friction. Logging an expense needed to feel like sending a text, not filing a form. If it takes more than five seconds, it does not get done.

That is the constraint this project was built around. Not analytics richness. Not financial insight. Friction, first. Everything else second.


The obvious question

The obvious answer to household expense tracking in 2026 is to use one of the many apps built for exactly this purpose. They are polished, well-supported, and have years of iteration behind them. So why build something instead?

The short answer is that the apps solved a slightly different problem than the one I had. The long answer involves a few specific objections.

Shared apps require shared buy-in. Getting a second person to install something, create an account, and learn a new interface is a non-trivial ask. Telegram is already on both of our phones. The bot lives inside a conversation thread. There is nothing to install and nothing to learn. That removes the largest single barrier to consistent use.

Most apps require structured input. You tap a category, enter an amount, select a merchant from a dropdown, confirm. That is still a form, just a mobile one. The bot accepts a text message. "45 Rewe" is the entirety of what you need to type. The AI handles everything else. The interface is already a language model, which means the input can be as casual and unstructured as actual human speech.

The data is ours and it stays ours. Commercial apps aggregate usage data. They have terms of service. Some connect directly to bank accounts, which introduces a category of risk that I am not comfortable with for a household finance tool. Google Sheets is a place I control. The bot writes to it. Nobody else reads it.

The spreadsheet is still the interface for the people who need it. My family members who are not developers can open the Google Sheet, filter by month, and see exactly what was recorded. There is no proprietary format, no export step, and no lock-in. The underlying data is always plain and accessible.

Those four things together made building more attractive than buying, even accounting for the time it took.


The design brief

Before writing any code, I wrote down what the system had to do and, just as importantly, what it did not need to do.

It had to accept natural language expense input without any required structure. "15 lunch," "12,50 pizza," "655 ETF investment," and a photo of a supermarket receipt all had to produce the same clean, structured row in the spreadsheet. The AI had to make every categorisation and parsing decision without asking the user for clarification.

It had to be truly serverless. Not because of cost, though that mattered, but because a tool that requires a running server is a tool with a maintenance burden. Vercel functions cost nothing at household scale and require no operational attention. The bot either responds instantly or it does not respond at all. There is no degraded state to manage.

It had to work across family members without any synchronisation mechanism beyond the shared spreadsheet. Two people logging expenses from different phones at the same time needed to produce two separate rows without collision. Race condition handling was a requirement, not an afterthought.

It had to be honest about what it could not do. A bot that confidently parses a receipt incorrectly and saves wrong data is worse than one that asks you to try again. Validation checks on parsed amounts, categories, and dates were built into the pipeline before any data reaches the spreadsheet.

The goals feature came later, but it followed the same brief. Create financial and task goals using a single natural language message. View them without navigating away from the conversation. Mark them complete when done. No special syntax. No form filling.


Architecture decisions

The stack is deliberately minimal. Telegram handles the interface. Vercel handles the runtime. Google Sheets handles the database. Groq handles the AI inference. Each component does one thing well and passes its output to the next.

Why Telegram over a dedicated web interface. The framing that drove this decision was simple: the best interface is the one people already have open. Telegram is a conversation. Adding a finance bot to an existing conversation thread is zero marginal effort. A dedicated web app, even a well-designed one, is a tab you have to remember to open.

Why Groq over a local model. This system handles real-time conversational input. A user sends a message and expects a response within a second or two. Local inference at that latency requires hardware that a household setup does not have available for a lightweight task like expense parsing. Groq’s LPU inference returns in under a second even for vision tasks like receipt scanning. The data involved is a purchase amount and a merchant name. Sending that to a cloud API is a different data governance decision than sending an immigration law query. I made the trade-off consciously.

Why Google Sheets over a proper database. The “database” for a household finance tracker is used by non-technical family members who want to open a spreadsheet, scroll through January, and understand what they spent on groceries. A PostgreSQL instance is not accessible to those users without a purpose-built frontend. Google Sheets is. The entire value of the persistent store is its accessibility, and Sheets provides that natively. The trade-offs, including limited query expressiveness and race conditions on concurrent writes, were all solvable at household scale.

Why a single serverless file. A single-file serverless handler eliminates the overhead of module management, import resolution across deployment environments, and the cognitive load of navigating a project structure for what is fundamentally a straightforward routing and persistence task. The entire bot fits in one file. Every component of the system is visible at once.


How it works

Pipeline: inputs and per-message flow

Trigger sources

Text message"45 Rewe" or "/goal Trip to Italy 2000"
Receipt photoVision model scans total + merchant
Button tapInline keyboard callback

↓ all routed through Telegram webhook

Per-message handler: Vercel serverless function

Webhook entrySecret token validated · user ID checked against allowlist
Route & dispatchExpense text • receipt photo • /goal • /goals • /summary • button callback
AI parsingStructured system prompt · user input wrapped in delimiters · JSON response format enforced
Groq · Llama 4 Vision
ValidationAmount range • category allowlist • date future-check • formula injection strip
Persist to Google SheetsExpenses tab or Goals tab · row append · in-process cache invalidated
gspread
/summary path: analyticspandas DataFrame · 2-min in-memory cache · category • user • merchant breakdowns
pandas
Telegram responseConfirmation with amount + category  •  interactive dashboard with drill-down  •  goals list with edit buttons
dashed border = analytics branch
pill = entry / exit
amber = final output
model tag = active service
The Family Finance Bot in Telegram showing an expense confirmation, the /summary dashboard with category breakdown, and the goals list with inline action buttons.
Left: expense confirmation after sending "45 Rewe". Centre: the /summary dashboard with category breakdown and user drill-down. Right: the goals list with inline edit and complete buttons.

The expense parsing prompt is where most of the system behaviour lives. It handles German decimal notation ("12,50" means 12.50 EUR, not 1250), local merchant shorthand ("DM" means dm-drogerie markt, not Deutsche Mark), and the full range of input formats a real user would actually type. The AI is instructed to return only valid JSON, to trust the exact number given without scaling it, and to default to "Other" rather than invent a category that does not exist. These rules are not clever. They are the result of actual failures in early testing that produced wrong data in the spreadsheet.

The goal parsing prompt has different challenges. It has to extract a deadline from natural language like "by next summer" or "in three months," classify the goal type from context, and decide whether something is a financial goal or a task based on whether it carries an amount. The AI is given the current date as context and instructed to return a null date rather than invent one when the input is ambiguous. A goal with no deadline is fine. A goal with a fabricated deadline is not.

The analytics engine runs on pandas. When a user calls /summary, the bot fetches the full Expenses sheet and caches the resulting DataFrame in memory for two minutes. This prevents repeated API calls to Google Sheets when family members hit the dashboard back-to-back, which would otherwise hit rate limits quickly. The dashboard uses Telegram’s inline keyboard to let users drill down without leaving the conversation. Clicking a user’s name opens their personal category breakdown. The interaction stays entirely within Telegram.


The quality gate

The most useful testing was simply using the bot daily. Real failures surfaced quickly: "655 investment ETF" parsed as 6.55 EUR; "DM" classified as Deutsche Mark rather than the drugstore chain. Each was a wrong row in the spreadsheet, which is worse than no row at all. The parsing rules were tightened in the prompt until the failures stopped.

Security and correctness required explicit attention. Webhook secret token validation, an entry-point allowlist, and formula injection stripping were all added after testing showed the gaps. The undo handler uses a timestamp re-fetch before deleting to avoid corrupting data when two people log expenses at the same moment.


What comes next

Two extensions are clearly defined.

The goals feature currently tracks flat lists. A more useful version would connect goal targets to actual spending patterns in the Expenses sheet. If you are saving for a trip and logging expenses consistently, the system should tell you whether your savings rate maps to your target date.

Multi-currency support is the more pressing gap. The household spans three currencies regularly: INR and USD from travel to India and the US, and CHF from occasional trips to Switzerland. All amounts are currently normalised to EUR at entry, which requires a manual conversion before typing. The fix is to detect the currency from the input, fetch an exchange rate at log time, and store both the original and the EUR equivalent.


What this taught me about building tools for actual use

A few things are worth stating plainly.

You do not need a frontend for a proof of concept. Telegram already existed on every phone in the household. Building inside it rather than building a new interface meant zero onboarding, no installation, and no habit to form. The tool gets used because it is already in a conversation thread people check anyway. The friction that stops a family member from logging a €15 lunch is the exact same friction that stops a sales team from updating their CRM. If the tool is not where the user already is, adoption fails regardless of how good the underlying system is.

Google Sheets is a legitimate persistence layer for personal tools. It is not a database. But for a household finance tracker, the value of the persistent store is that family members can open it directly, without going through the bot. Plain and accessible beats sophisticated and opaque.

Daily use is the only honest test. The parsing bugs that mattered did not appear in any test I designed. They appeared because the bot was used every day by real people. Ship to production early. That is where the real failures are.

The implementation was built using Claude Code as a coding assistant. The product decisions, prompt engineering, and architecture were mine.

© 2026 Arnav Amal Ray. All rights reserved.