TARS

A locally-hosted AI agent that remembers things, runs on my own hardware, and listens to voice notes over Telegram.

Running

I got tired of assistants that reset every time I opened a chat. No memory from last week, no context on active projects, no continuity. So I built one that keeps context. TARS runs on my Linux machine, takes voice notes through Telegram, transcribes locally with Whisper, handles the request, and writes useful details into an Obsidian vault that carries across sessions. It feels less like a chatbot and more like a second brain I can talk to.

PythonOllamaWhispern8nOpenClawAnthropic API

The memory architecture.

Most agent setups I tried either ignored memory or jammed too much into a system prompt. Neither held up in practice. Without persistence, the agent forgets everything. With prompt stuffing, it burns context on old details before the conversation even starts.

My setup uses an Obsidian vault at ~/brain/with a clear directory structure. During a conversation, Haiku writes relevant details into today's daily note. At 11pm, a Sonnet cron job reads through the day's log, extracts anything worth keeping, and writes or updates files in memory/ with proper wikilinks.

~/brain/
  daily/       session logs, appended during conversations
  memory/
    people/    notes on individuals
    projects/  per-project context
  knowledge/   reference material
  inbox/       staging for unprocessed notes

This two-layer setup keeps live costs low and long-term memory useful. Haiku handles live interactions cheaply. Sonnet runs once at night and does the heavier consolidation work, deciding what should be kept and where it belongs. Daily notes stay as raw logs, and memory files stay organized and linked.

Obsidian vault memory files

The ~/brain/ vault structure showing daily notes and consolidated memory files

Debugging the voice transcription pipeline during early setup, errors included

It started with n8n.

My first instinct was to build the agent workflow in n8n. I already had it running from earlier automation work, and I wanted predictable branching and deterministic steps. I wired nodes for Whisper transcription, model routing, vault writes, and responses. The structured parts worked.

The weak spot was conversation state. n8n is built for discrete workflows that start and finish. Keeping context alive across back-and-forth dialogue meant pushing the tool past what it is designed for. Session boundaries kept forcing fragile state patches.

Then I found OpenClaw

That is when I moved to OpenClaw, a self-hosted framework built for persistent agents. It handles personality config, memory access, and conversation continuity directly. The workarounds I needed in n8n were not necessary there.

Starting with n8n still helped because it made the requirements obvious. Pipeline automation and stateful conversation are different problems, so I separated them. OpenClaw runs the agent and memory layer. n8n handles structured background jobs such as nightly vault consolidation. Each tool does what it is good at.

The stack.

The hardware runs everything locally: Ryzen 9 9900x, RTX 5060 Ti (16GB VRAM), 32GB DDR5, PopOS Linux. Local inference runs through Ollama. Claude Haiku handles most conversations via the Anthropic API. Sonnet handles heavier reasoning and the nightly cron job that consolidates the day's notes into the vault's permanent memory layer.

Local inferenceOllama with qwen3:14b as the primary local model. It fits in VRAM without aggressive quantization and handles lightweight routing and scoring.
Anthropic APIClaude Haiku 4.5 for conversation. Claude Sonnet 4.6 for the nightly consolidation cron and anything that needs better reasoning quality.
Voice pipelinefaster-whisper for local transcription. Fully offline, low latency on the RTX 5060 Ti.
Agent frameworkOpenClaw, self-hosted. It manages personality config, the conversation loop, and vault access.
Workflow automationn8n for structured background pipelines. It runs scheduled tasks and deterministic multi-step jobs.
MemoryObsidian vault. Haiku appends during conversations; Sonnet consolidates nightly.
CommunicationTelegram. Voice notes in, text responses out.

TARS responding to a voice message via Telegram

Token and rate-limit optimization.

After the core loop was stable, I treated token usage and API limits as system constraints instead of afterthoughts. Voice transcripts, retrieval results, and long-running context can inflate prompt size fast, so I added guardrails that keep each request inside a defined budget before it reaches a model.

Context packingRetrieved notes are deduplicated and compressed into high-signal bullets so only relevant memory enters the prompt window.
Token budgetsPer-task input and output ceilings are enforced before execution, with different limits by model and workflow type.
Routing by cost/latencyCheap local and Haiku paths handle routine tasks; Sonnet is reserved for high-value reasoning and nightly consolidation.
Rate-limit resilienceQueued execution, jittered exponential backoff, and fallback paths reduce failed requests during bursty traffic.
Token observabilityUsage is tracked by pipeline stage so regressions in prompt size or output length can be caught and corrected quickly.

This reduced token burn significantly and made the system much less sensitive to API rate-limit spikes, while keeping response quality consistent for daily use.

What I'd do differently.

Define memory before writing code

I went through a few vault structure iterations before landing on the current one, slower than they needed to be because I didn't have a clear model of what I was building toward. The two-layer daily/permanent split seems obvious in hindsight. I'd start there.

Start from requirements, not available tooling

I moved to n8n partly because I already had it running. That is not a great architecture decision by itself, and it cost me a few weeks. Pick tools that fit the problem, not tools that happen to be available.

Running in daily use. The core loop is stable: voice in, transcription, reasoning, vault write, response out. I am still iterating on memory architecture to separate what is genuinely useful from what turns into noise.