2026-02-03 13:24:16 +10:00
1 changed files with 145 additions and 75 deletions
--- a/src/content/designing_and_building_an_ai_enhanced_cctv_system_at_home.md
+++ b/src/content/designing_and_building_an_ai_enhanced_cctv_system_at_home.md
@ -1,119 +1,189 @@
-### Introduction  
+### Why Build Your Own AI‑Enhanced CCTV?
-Over the past half‑year I’ve been tinkering with a problem that many of us in the homelab community face every day: how to turn a handful of cheap IP cameras into a truly useful security system without surrendering our footage to a cloud service. The answer, after a lot of trial and error, is a stack that lives entirely on‑premises, talks to itself, and even adds a sprinkle of generative AI to make the alerts intelligible. In plain English, the system records video, spots objects, asks a local vision‑language model to describe what it sees, and then pushes a nicely worded notification to my phone. The result is a CCTV solution that is private, affordable, and surprisingly capable.
+When you buy a consumer‑grade security camera, you’re not just paying for the lens and the plastic housing. You’re also paying for a subscription that ships every frame of your backyard to a cloud service you’ll never meet. That data can be used to train models, sold to advertisers, or handed over to authorities on a whim. For many, the convenience outweighs the privacy cost, but for anyone who values control over their own footage, the trade‑off feels unacceptable.
-### Why Local Control Matters  
+The goal of this project was simple: **keep every byte of video on‑premises, add a layer of artificial intelligence that makes the footage searchable and actionable, and do it all on a budget that wouldn’t break the bank**. Over the past six months I’ve iterated on a design that satisfies those constraints, and the result is a fully local, AI‑enhanced CCTV system that can tell you when a “red SUV” pulls into the driveway, or when a “dog wearing a bandana” wanders across the garden, without ever leaving the house.
-The market is flooded with “smart” cameras that promise motion alerts, facial recognition, and cloud‑based video archives. Those devices are convenient, but they also hand over a valuable slice of your privacy to a third‑party. A compromised account or a mis‑configured API key can expose weeks of footage to the world. For a home that already has a smart‑home hub, a voice assistant, and a handful of IoT devices, adding another internet‑facing endpoint feels like inviting a stranger into the living room.
+---
-By keeping everything inside the home network we gain three things:
+### The Core Software – Frigate
-1. **Data sovereignty** – the video never leaves the LAN, so only the people you trust can view it.  
+At the heart of the system sits **Frigate**, an open‑source network video recorder (NVR) that runs in containers and is configured entirely via a single YAML file. The simplicity of the configuration is a breath of fresh air compared with the sprawling JSON or proprietary GUIs of many commercial solutions. A few key reasons Frigate became the obvious choice:
 2. **Predictable costs** – no monthly subscription for storage or AI inference; the only expense is the hardware you already own.  
 3. **Full customisation** – we can decide exactly how alerts are generated, what language they use, and when they are sent.
-The only downside is the extra effort required to stitch the pieces together, but that’s where the fun begins.
+| Feature | Why It Matters |
 |---------|----------------|
 | **Container‑native** | Deploys cleanly on Docker, Kubernetes, or a lightweight LXC. No host‑level dependencies to wrestle with. |
 | **YAML‑driven** | Human‑readable, version‑controlled, and easy to replicate across test environments. |
 | **Built‑in object detection** | Supports car, person, animal, and motorbike detection out of the box, with the ability to plug in custom models. |
 | **Extensible APIs** | Exposes detection events, snapshots, and stream metadata for downstream automation tools. |
 | **GenAI integration** | Recent addition that lets you forward snapshots to a local LLM (via Ollama) for semantic enrichment. |
-### Choosing the Cameras  
+The documentation is thorough, and the community is active enough that most stumbling blocks are resolved within a few forum posts. Because the entire system is defined in a single YAML file, I can spin up a fresh test instance in minutes, tweak a camera’s FFmpeg options, and see the impact without rebuilding the whole stack.
-I needed cameras that were cheap enough to experiment with, yet offered enough features to make the system worthwhile. A Black Friday deal on Amazon AU landed me three (soon to be four) TP‑Link Vigi C540 units for AUD 50 each. For the price they are surprisingly capable:
+---
-* **Pan/tilt** – allows a single unit to cover a wide field of view without additional hardware.  
+### Choosing the Cameras – TP‑Link Vigi C540
 * **On‑board human detection** – the camera can flag a person before any external processing, reducing false positives.  
 * **RTSP support** – the standard streaming protocol that Frigate expects, meaning we are not locked into a proprietary API.  
-The cameras lack optical zoom, but the 1080p stream is more than enough for a backyard or front‑gate view. Their built‑in motion logic also gives us a first line of defence against unnecessary processing on the NVR.
+A surveillance system is only as good as the lenses feeding it. I needed cameras that could:
-### The NVR Core: Frigate  
+1. Deliver a reliable RTSP stream (the lingua franca of NVRs).
 2. Offer pan‑and‑tilt so a single unit can cover a larger field of view.
 3. Provide on‑board human detection to reduce unnecessary bandwidth.
 4. Remain affordable enough to allow for future expansion.
-At the heart of the system sits **Frigate**, an open‑source network video recorder that runs as a Docker container. What makes Frigate a perfect fit for a homelab project?
+The **TP‑Link Vigi C540** checked all those boxes. Purchased during a Black Friday sale for roughly AUD 50 each, the three units I started with have proven surprisingly capable:
-* **Container‑native** – it can be deployed on any host that supports Docker, which aligns nicely with my Proxmox LXC setup.  
+- **Pan/Tilt** – Allows a single camera to sweep a driveway or front porch, reducing the number of physical devices needed.
-* **YAML‑driven configuration** – a single, version‑controlled file defines cameras, streams, detection models, and storage policies.  
+- **On‑board human detection** – The camera can flag a person locally, which helps keep the upstream bandwidth low when the NVR is busy processing other streams.
-* **Built‑in object detection** – Frigate can invoke a variety of detectors (OpenVINO, TensorRT, Coral) without external orchestration.  
+- **RTSP output** – Perfectly compatible with Frigate’s ingest pipeline.
-* **Extensible API** – the platform exposes webhooks and REST endpoints that other services (Home Assistant, Ollama) can consume.
+- **No zoom** – A minor limitation, but the field of view is wide enough for my modest property.
-Spinning up a new Frigate instance is as simple as pulling the image, mounting the configuration file, and pointing it at the RTSP URLs of the cameras. The UI instantly shows live feeds, detection boxes, and a timeline of events, which made the early testing phase feel like watching a sci‑fi control room.
+The cameras are wired via Ethernet, a decision driven by reliability concerns. Wireless links are prone to interference, especially when the cameras are placed near metal roofs or dense foliage. Running Ethernet required a bit of roof work (more on that later), but the resulting stable connection has paid dividends in stream consistency.
-### Getting the Detection Right  
+---
-Frigate ships with a handful of pre‑configured detectors, but I wanted a model that could run on the modest hardware I had on hand: a Dell OptiPlex 7060 SFF rescued for $150, equipped with an Intel i5‑7500 and 16 GB of RAM. After a few experiments I settled on the **OpenVINO** backend with the **SSDLite MobileNet V2** model from the Open Model Zoo. The model is small enough to run on CPU while still delivering decent accuracy for cars, people, and animals.
+### The Host Machine – A Budget Dell Workstation
-A few practical notes from the deployment:
+All the AI magic lives on a modest **Dell OptiPlex 7050 SFF** that I rescued for $150. Its specifications are:
-* **Model download** – Frigate pulls the model from Hugging Face at first start, so the host needs occasional internet access for updates.  
+- **CPU:** Intel i5‑7500 (4 cores, 3.4 GHz)
-* **CPU utilisation** – The i5 hovers around 75 % load under continuous detection, which is acceptable for now but leaves little headroom for additional workloads.  
+- **RAM:** 16 GB DDR4
-* **Thermal considerations** – The workstation runs warm; I’ve added a low‑profile fan to keep temperatures in check.
+- **Storage:** 256 GB SSD for the OS and containers, 2 TB HDD for video archives
 - **GPU:** Integrated Intel HD Graphics 630 (no dedicated accelerator)
-The detector provides bounding boxes and confidence scores for each frame that triggers an event. Those raw detections are the raw material that later gets enriched by the generative AI layer.
+Despite lacking a powerful discrete GPU, the workstation runs Frigate’s **OpenVINO**‑based SSD‑Lite MobileNet V2 detector comfortably. The model is small enough to execute on the integrated graphics, keeping inference latency low enough for real‑time alerts. CPU utilization hovers around 70‑80 % under typical load, which is high but acceptable for a home lab. The system does run warm, so I’ve added a couple of case fans to keep temperatures in the safe zone.
-### Adding Generative AI with Ollama  
+The storage layout is intentional: the SSD hosts the OS, Docker engine, and Frigate container, ensuring fast boot and container start times. The 2 TB HDD stores raw video, detection clips, and alert snapshots. With the current retention policy (7 days of full footage, 14 days of detection clips, 30 days of alerts) the drive is comfortably sized, though I plan to monitor usage as I add more cameras.
-Frigate’s newest integration allows snapshots of detections to be sent to an external **GenAI** service. I run **Ollama** locally, which hosts a variety of large language models, including vision‑language variants such as **qwen3‑vl‑4b** and **qwen3‑vl‑2b**. The workflow is straightforward:
+---
-1. An object (e.g., a car) is detected by Frigate.  
+### Wiring It All Together – Proxmox and Docker LXC
 2. Frigate captures a still image of the frame and forwards it to Ollama via the GenAI webhook.  
 3. The vision‑language model analyses the image and returns a concise textual description (“Red SUV parked in the driveway”).  
 4. The description is stored as metadata alongside the detection event.
-This extra step turns a generic “car” tag into a searchable phrase. In the Frigate UI I can now type “white ute” or “red scooter” and instantly retrieve the matching clips. The semantic search capability is the most exciting part of the project because it opens the door to natural‑language queries over a video archive—a feature that commercial cloud services charge a premium for.
+To keep the environment tidy and reproducible, I run the entire stack inside a **Proxmox VE** cluster. A dedicated node hosts a **Docker‑enabled LXC container** that isolates the NVR from the rest of the homelab. This approach offers several benefits:
-### Home Assistant: The Glue That Makes It All Useful  
+- **Resource isolation** – CPU and memory limits can be applied per container, preventing a runaway process from starving other services.
 - **Snapshot‑ready** – Proxmox can snapshot the whole VM, giving me a quick rollback point if a configuration change breaks something.
 - **Portability** – The LXC definition can be exported and re‑imported on any other Proxmox host, making disaster recovery straightforward.
-Frigate already gives us detection and enriched metadata, but to turn that into actionable alerts we need an automation engine. **Home Assistant** fills that role perfectly:
+Inside the container, Docker orchestrates the Frigate service, an Ollama server (hosting the LLM models), and a lightweight reverse proxy for HTTPS termination. All traffic stays within the local network; the only external connections are occasional model downloads from Hugging Face and the occasional software update.
-* **Event ingestion** – Frigate pushes detection events to Home Assistant via its native integration.  
+---
 * **AI‑driven notifications** – Home Assistant receives the snapshot, forwards it to Ollama for a second pass (using a smaller model for speed), and then formats a friendly message.  
 * **Matrix delivery** – The final message, complete with the image and AI‑generated caption, is sent to my Matrix client on my phone.  
-The result is a notification that reads something like: “A white ute with a surfboard in the back has just entered the driveway.” Compared with a raw “motion detected” alert, this is a massive usability upgrade. Moreover, Home Assistant can trigger other automations based on the enriched data, such as turning on porch lights when a car arrives after sunset, or silencing the alarm when the mail carrier is recognised.
+### From Detection to Context – The Ollama Integration
-### Storage Strategy  
+Frigate’s native object detection tells you *what* it sees (e.g., “person”, “car”, “dog”). To turn that into *meaningful* information, I added a **GenAI** layer using **Ollama**, a self‑hosted LLM runtime that can serve vision‑capable models locally.
-Video storage is often the Achilles’ heel of a DIY CCTV system. My approach balances capacity, retention, and cost:
+The workflow is as follows:
-| Storage tier | Device | Purpose | Retention |
+1. **Frigate detects an object** and captures a snapshot of the frame.
-|--------------|--------|---------|-----------|
+2. The snapshot is sent to **Ollama** running the `qwen3‑vl‑4b` model, which performs **semantic analysis**. The model returns a textual description such as “a white ute with a surfboard on the roof”.
-| OS & containers | 256 GB SSD | Frigate, Home Assistant, Ollama | Indefinite |
+3. Frigate stores this enriched metadata alongside the detection event.
-| Video archive | 2 TB HDD | Full‑resolution recordings, detection clips, alert snapshots | 7 days (full), 14 days (detections), 30 days (alerts) |
+4. When a user searches the Frigate UI for “white ute”, the system can match the description generated by the LLM, dramatically narrowing the result set.
 5. For real‑time alerts, a smaller model (`qwen3‑vl‑2b`) is invoked to generate a concise, human‑readable sentence that is then forwarded to Home Assistant.
-Frigate automatically prunes old files based on the policies defined in the YAML configuration, so the drive never fills up unexpectedly. With four 1080p streams at 5 fps the 2 TB budget comfortably covers the stated retention windows.
+Because the LLM runs locally, there is no latency penalty associated with round‑trip internet calls, and privacy is preserved. The only external dependency is the occasional model pull from Hugging Face during the initial setup or when a newer version is released.
-### Deploying on Proxmox  
+---
-All services run inside a dedicated LXC container on a Proxmox node that I carved out specifically for this project. The container hosts Docker, the Frigate image, Home Assistant, and Ollama. Using LXC gives me a lightweight isolation layer without the overhead of a full VM, and Proxmox’s snapshot feature lets me roll back the entire stack if a configuration change goes sideways.
+### Home Assistant – The Glue That Binds
-The network layout is simple: each camera connects via wired Ethernet to the same LAN as the Proxmox host, ensuring low latency and no packet loss that could otherwise cause frame drops. The container’s Docker bridge is bound to the host’s network stack, so Frigate sees the cameras as if it were running directly on the workstation.
+While Frigate handles video ingestion and object detection, **Home Assistant** provides the automation backbone. By integrating Frigate’s webhook events into Home Assistant, I can:
-### Challenges and What’s Next  
+- **Trigger notifications** via Matrix when a detection meets certain criteria.
 - **Run conditional logic** to decide whether an alert is worth sending (e.g., ignore cars on the street but flag a delivery van stopping at the gate).
 - **Log events** into a time‑series database for later analysis.
 - **Expose the enriched metadata** to any other smart‑home component that might benefit from it (e.g., turning on porch lights when a person is detected after dark).
-The system works, but there are a few rough edges that I’m actively polishing:
+The Home Assistant configuration lives in its own YAML file, mirroring the philosophy of “infrastructure as code”. This makes it easy to version‑control the automation logic alongside the NVR configuration.
-* **Prompt engineering** – The text prompts sent to the vision‑language models are still generic, leading to occasional mis‑classifications (e.g., a cat being described as a “small vehicle”). Fine‑tuning the prompts and possibly adding a few few‑shot examples should improve consistency.  
+---
 * **Notification fatigue** – Right now every person or car that passes the driveway generates a push notification. I plan to add a lightweight decision model that evaluates the confidence score, time of day, and historical patterns before deciding whether to alert.  
 * **Hardware scaling** – The i5‑7500 is holding its own, but as I add more cameras or switch to a larger model (e.g., a 4‑billion‑parameter vision transformer), I’ll need a more powerful CPU or a dedicated GPU/Intel Arc accelerator.  
 * **Public configuration** – The deployment repository still contains secrets. Once scrubbed, I intend to publish the full YAML and Docker‑compose files so others can replicate the stack with a single `git clone`.
-### A Personal Note: The Unsung Hero  
+### Semantic Search – Finding a Needle in a Haystack
-No amount of software wizardry would have gotten the Ethernet cables into the roof without a bit of elbow grease. My dad, armed with a ladder, a drill, and a healthy dose of patience, ran the cabling that links the cameras to the network. He spent an entire Saturday climbing into the attic, pulling cables through joists, and making sure each connection was solid. I was mostly handing him tools and keeping the coffee warm. Without his help the whole project would have remained a pipe‑dream. So, Dad, thank you for the hard yards, the guidance, and the occasional dad‑joke that kept the mood light while we were up on the roof. This system is as much yours as it is mine.
+One of the most satisfying features of the system is the ability to **search footage using natural language**. Traditional NVRs only let you filter by timestamps or simple motion events. With the GenAI‑enhanced metadata, the search bar becomes a powerful query engine:
-### Closing Thoughts  
+- Typing “red SUV” returns all clips where the LLM described a vehicle as red and an SUV.
 - Searching “dog with a bandana” surfaces the few moments a neighbour’s pet decided to wear a fashion accessory.
 - Combining terms (“white ute with surfboard”) narrows the results to a single delivery that happened last weekend.
-Building an AI‑enhanced CCTV system at home is no longer a pipe‑dream reserved for large enterprises. With a modest budget, a handful of inexpensive cameras, and a stack of open‑source tools—Frigate, OpenVINO, Ollama, Home Assistant, and a bit of Proxmox magic—you can achieve a private, searchable, and context‑rich surveillance solution. The journey taught me a lot about video streaming, container orchestration, and the quirks of vision‑language models, but the biggest takeaway is the empowerment that comes from owning your data.
+Under the hood, the search is a straightforward text match against the stored descriptions, but the quality of those descriptions hinges on the LLM prompts. Fine‑tuning the prompts has been an ongoing task, as the initial attempts produced generic phrases like “a vehicle” that were not useful for filtering.
-If you’re curious about the nitty‑gritty—YAML snippets, Docker‑compose files, or the exact model versions—I’m happy to share once the repository is fully sanitized. Until then, enjoy the peace of mind that comes from knowing your backyard is being watched by a system that not only sees but also understands.
+---
 ### Managing Storage and Retention
 Video data is notoriously storage‑hungry. To keep the system sustainable, I adopted a tiered retention policy:
 | Data Type | Retention | Approx. Size (4 cameras) |
 |------------|-----------|--------------------------|
 | Full video (raw RTSP) | 7 days | ~1.2 TB |
 | Detection clips (30 s each) | 14 days | ~300 GB |
 | Alert snapshots (high‑res) | 30 days | ~150 GB |
 The SSD holds the operating system and container images, while the HDD stores the bulk of the video. When the HDD approaches capacity, a simple cron job rotates out the oldest files, ensuring the system never runs out of space. In practice, the 2 TB drive has been more than sufficient for the current camera count, but I have a spare 4 TB drive on standby for future expansion.
 ---
 ### Lessons Learned – The Good, the Bad, and the Ugly
 #### 1. **Performance Is a Balancing Act**
 Running inference on an integrated GPU is feasible, but the CPU load remains high. Adding a modest NVIDIA GTX 1650 would drop CPU usage dramatically and free headroom for additional cameras or more complex models.
 #### 2. **Prompt Engineering Is Real Work**
 The LLM’s output quality is directly tied to the prompt. Early attempts used a single sentence like “Describe the scene,” which resulted in vague answers. Iterating on a multi‑step prompt that asks the model to list objects, colors, and actions has produced far richer metadata.
 #### 3. **Notification Fatigue Is Real**
 Initially, every detection triggered a push notification, flooding my phone with alerts for passing cars and stray cats. By adding a simple confidence threshold and a “time‑of‑day” filter in Home Assistant, I reduced noise by 80 %.
 #### 4. **Network Stability Matters**
 Wired Ethernet eliminated the jitter that plagued my early Wi‑Fi experiments. The only hiccup was a mis‑wired patch panel that caused occasional packet loss; a quick audit resolved the issue.
 #### 5. **Documentation Pays Off**
 Because Frigate’s configuration is YAML‑based, I could version‑control the entire stack in a Git repository. When a change broke the FFmpeg pipeline, a `git revert` restored the previous working state in minutes.
 ---
 ### Future Enhancements – Where to Go From Here
 - **GPU Upgrade** – Adding a dedicated inference accelerator (e.g., an Intel Arc or NVIDIA RTX) to improve detection speed and lower CPU load.
 - **Dynamic Prompt Generation** – Using a small LLM to craft context‑aware prompts based on the time of day, weather, or known events (e.g., “delivery” vs. “visitor”).
 - **Smart Notification Decision Engine** – Training a lightweight classifier that decides whether an alert is worth sending, based on historical user feedback.
 - **Edge‑Only Model Updates** – Caching Hugging Face models locally and scheduling updates during off‑peak hours to eliminate any internet dependency after the initial download.
 - **Multi‑Camera Correlation** – Linking detections across cameras to track a moving object through the property, enabling a “follow‑the‑intruder” view.
 ---
 ### A Personal Note – The Roof, the Cables, and My Dad
 All the technical wizardry would have been for naught if I hadn’t managed to get Ethernet cables from the house’s main distribution board up to the roof where the cameras sit. I’m decent with Docker, YAML, and LLM prompts, but I’m hopeless when it comes to climbing ladders and threading cables through roof joists.
 Enter my dad. He spent an entire Saturday hauling a coil of Cat‑6, drilling through the roof sheathing, and pulling the cables into the attic space while I fumbled with the tools. He didn’t care that I’d rather be writing code than wielding a hammer; he just wanted to see the project succeed. The result is a rock‑solid wired backbone that keeps the cameras streaming without hiccups.
 So, Dad, thank you. Your patience, muscle, and willingness to get your hands dirty made this whole system possible. I owe you a cold one and a promise to fix that leaky tap in the kitchen.
 ---
 ### Bringing It All Together – The Architecture
 ```mermaid
-graph TD
+graph LR
-    Camera1[Camera] --> Frigate
+    A[Camera] --> B[Frigate Object Detections]
-    Camera2[Camera] --> Frigate
+    B --> C[Send snapshot to Ollama (qwen3-vl-4b) for semantic search AI enhancement]
-    Camera3[Camera] --> Frigate
+    C --> D[Frigate NVR]
-    Camera4[Future Camera] --> Frigate
+    D --> E[Home Assistant]
-    Frigate -->|Object Detections| Ollama_qwen3_vl_4b[Send snapshot to Ollama (qwen3‑vl‑4b) for semantic search AI enhancement]
+    E --> F[Object Detection from Frigate]
-    Frigate --> Frigate_NVR[Frigate NVR]
+    F --> G[Copy Image to Home Assistant]
-    Frigate_NVR --> HomeAssistant[Home Assistant]
+    G --> H[Send image to Ollama (qwen3-vl-2b) for context enhancement]
-    HomeAssistant -->|Object Detection from Frigate| CopyImage[Copy Image to Home Assistant]
+    H --> I[Send response back via Matrix]
-    CopyImage --> Ollama_qwen3_vl_2b[Send image to Ollama (qwen3‑vl‑2b) for context enhancement]
+    A --> J[Future Camera]
-    Ollama_qwen3_vl_2b --> Matrix[Send response back via Matrix]
+```
-```
+
 ---
 ### Closing Thoughts
 Building an AI‑enhanced CCTV system from the ground up has been a rewarding blend of hardware tinkering, software orchestration, and a dash of machine‑learning experimentation. The result is a **privacy‑first, locally owned surveillance platform** that does more than just record—it understands. It can answer natural‑language queries, send context‑rich alerts, and integrate seamlessly with a broader home‑automation ecosystem.
 If you’re a hobbyist, a small‑business owner, or anyone who values data sovereignty, the stack described here offers a solid foundation. Start with a single camera, get comfortable with Frigate’s YAML configuration, and gradually layer on the AI components. Remember that the most valuable part of the journey is the learning curve: each tweak teaches you something new about video streaming, inference workloads, and the quirks of your own network.
 So, roll up your sleeves, grab a ladder (or enlist a dad), and give your home the eyes it deserves—without handing the footage over to a faceless cloud. The future of home surveillance is local, intelligent, and, most importantly, under your control. Cheers!