## The hardware
- 5x Raspberry Pi Zero 2W ($15 each)
- 5x ArduCam IMX708 12MP 120-degree wide-angle cameras
- 5x WM8960 audio HATs
- 1x 13TB Ugreen NAS
- Custom Python daemon: motion/audio detection, triggered MJPEG/H.264 + WAV recording, idle sleep
Total cost: under $500. Runs 24/7 across five rooms: office, kitchen, living room, bedroom, hallway.
## Why
Your AI knows you through text. Only text. It has never seen your face, never watched you work, never noticed you pacing the room before a stressful call. That behavioral data is more valuable than anything you'll ever type into a prompt.
I spent weeks debugging device tree overlays, swapped camera modules three times (started with ov64a40 64MP, settled on IMX708 12MP after thermal issues killed the first setup), and burned through two Pi Zeros that couldn't handle the heat. This was infrastructure work, not a weekend hack.
## The stack
There are three layers to AI that actually understands you:
- Observation — cameras, mics, sensors. Physical-world capture.
- Memory — persistent, intelligent, cross-session. Not a vector dump.
- Reasoning — the LLM. Already good enough.
Everyone builds layer 3. I built TrueMemory for layer 2 (arXiv paper). Now I'm building layer 1.
## What I learned
The capture hardware is the easy part. Cheap sensors on cheap boards, running cheap compute. The hard part is the pipeline between raw sensor data and LLM-ingestible context. Motion detection to recording to vision model inference to structured text to memory to retrieval. Every step introduces noise. Solving that pipeline is where the real work lives.
Josh Adler builds persistent memory and physical-world awareness for AI. joshadler.com | arXiv paper