blog_creator/README.md

290 lines
9.4 KiB
Markdown

# Blog Creator
An automated blog generation system that uses CrewAI agents to research, write, and edit blog posts from Trilium notes.
## Architecture
The system uses three CrewAI crews orchestrated by a Flow:
1. **Research Crew** - A critical researcher agent with web search capabilities investigates the topic and produces verified findings
2. **Writing Crew** - Four creative journalist agents write draft blog articles in parallel, each with different creative styles
3. **Editor Crew** - A critical editor loads the drafts into a vector database, queries for relevant context, and produces the final polished document with metadata
## Requirements
- Python 3.10 or later
- Ollama server running with required models
- ChromaDB server for vector storage
- Trilium notes instance
- Gitea instance (for automated workflows)
- n8n instance (for notifications)
## Environment Variables
Create a `.env` file in the project root with the following variables:
```
# Trilium Configuration
TRILIUM_HOST=
TRILIUM_PORT=
TRILIUM_PROTOCOL=https
TRILIUM_PASS=
TRILIUM_TOKEN=
# Ollama Configuration
OLLAMA_PROTOCOL=http
OLLAMA_HOST=
OLLAMA_PORT=11434
EMBEDDING_MODEL=nomic-embed-text
EDITOR_MODEL=llama3.1:8b
CONTENT_CREATOR_MODELS=["phi4-mini:latest", "qwen3:1.7b", "gemma3:latest"]
# ChromaDB Configuration
CHROMA_HOST=chroma
CHROMA_PORT=8000
# Git Configuration
GIT_USER=
GIT_PASS=
GIT_PROTOCOL=https
GIT_REMOTE=git.aridgwayweb.com/armistace/blog.git
# Notification Configuration
N8N_SECRET=
N8N_WEBHOOK_URL=
# Ollama Web Search (required for researcher agent)
OLLAMA_API_KEY=
```
### CONTENT_CREATOR_MODELS Format
The `CONTENT_CREATOR_MODELS` variable should be a JSON array of Ollama model names. Each model will be used by one of the three journalist agents. Example:
```
CONTENT_CREATOR_MODELS=["llama3.1:8b", "qwen2.5:7b", "phi4:latest"]
```
### OLLAMA_API_KEY
The researcher agent uses Ollama's native web search API. Create an API key from your Ollama account (https://ollama.com) and add it to your `.env` file. This uses your existing Ollama subscription for web searches.
## Project Structure
```
blog_creator/
├── .env # Environment variables (create this)
├── .gitea/workflows/deploy.yml # Gitea Actions workflow
├── docker-compose.yml # Local development setup
├── requirements.txt # Python dependencies
├── README.md # This file
└── src/
├── main.py # Entry point
└── ai_generators/
├── ollama_md_generator.py # Main interface (used by main.py)
├── blog_flow.py # CrewAI Flow orchestrator
├── crews/
│ ├── research_crew/ # Researcher agent with web search
│ ├── writing_crew/ # Three journalist agents
│ └── editor_crew/ # Editor agent with metadata generation
└── tools/
```
## Local Development Setup
### Using Docker Compose
1. Clone the repository and navigate to the project directory
2. Create your `.env` file with all required variables
3. Start the services:
```bash
docker-compose up -d
```
This starts:
- `blog_creator` - The main application container
- `chroma` - ChromaDB vector database
4. The container will run `main.py` automatically on startup. To run manually:
```bash
docker-compose exec blog_creator python src/main.py
```
### Manual Setup (without Docker)
1. Install system dependencies:
```bash
apt update && apt install -y rustc cargo python-is-python3 pip python3-venv libmagic-dev git
```
2. Create and activate a virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate
```
3. Install Python dependencies:
```bash
pip install -r requirements.txt
```
4. Configure Git:
```bash
git config --global user.name "Blog Creator"
git config --global user.email "your-email@example.com"
git config --global push.autoSetupRemote true
```
5. Run the application:
```bash
python src/main.py
```
## How It Works
### Trilium Integration
The system fetches notes from Trilium that are tagged for blog creation. Each note becomes one blog post. The note content is used as the basis for the AI-generated article.
### Blog Generation Flow
1. **Research Phase** - The researcher agent investigates the topic using web search, critically evaluates claims, and produces verified findings
2. **Writing Phase** - Three journalist agents write creative drafts in parallel, each with different temperature and top_p settings for variety
3. **Editor Phase** - The editor:
- Chunks and embeds all drafts into ChromaDB
- Queries the vector database for relevant context
- Generates the final polished document with metadata header
### Output Format
Each blog post includes a metadata header followed by the markdown body:
```
Title: Designing and Building an AI Enhanced CCTV System
Date: 2026-02-02 20:00
Modified: 2026-02-02 20:00
Category: Homelab
Tags: proxmox, hardware, self host, homelab, ai_content, not_human_content
Slug: ai-enhanced-cctv
Authors: phi4-mini.ai, qwen3.ai, gemma3.ai
Summary: Home CCTV Security has become a bastion of cloud subscription awfulness. This blog describes creating your own AI enhanced system.
<full markdown blog body follows>
```
The metadata fields are generated as follows:
- **Title** - From the Trilium note title
- **Date/Modified** - Current datetime when generated
- **Category** - AI-generated single word (e.g., Homelab, DevOps, Security)
- **Tags** - AI-generated relevant tags plus `ai_content, not_human_content`
- **Slug** - AI-generated URL-friendly slug
- **Authors** - Derived from CONTENT_CREATOR_MODELS (model name + `.ai`)
- **Summary** - AI-generated 15-25 word summary
### Git Workflow
After generation, the blog post is:
1. Committed to a new branch named after the slug
2. Pushed to the configured Git remote
3. A notification is sent via n8n to Matrix for review
## Gitea Actions Workflow
The `.gitea/workflows/deploy.yml` file defines an automated workflow that:
- Runs on a schedule (daily at 18:15 UTC) or on push to master branch
- Installs all dependencies
- Creates the `.env` file from Gitea secrets and variables
- Runs the blog generation script
### Setting Up Gitea Variables
In your Gitea repository settings, configure the following:
**Variables** (Repository Settings -> Variables):
- `TRILIUM_HOST` - Your Trilium server hostname
- `TRILIUM_PORT` - Trilium port
- `TRILIUM_PROTOCOL` - http or https
- `OLLAMA_PROTOCOL` - http or https
- `OLLAMA_HOST` - Ollama server hostname
- `OLLAMA_PORT` - Ollama port (default 11434)
- `EMBEDDING_MODEL` - Embedding model name
- `EDITOR_MODEL` - Editor/Researcher model name
- `CONTENT_CREATOR_MODELS_1` through `CONTENT_CREATOR_MODELS_4` - Individual model names (the workflow joins these into an array)
- `GIT_PROTOCOL` - https or ssh
- `GIT_REMOTE` - Git repository URL
- `GIT_USER` - Git username for pushing
- `N8N_WEBHOOK_URL` - n8n webhook URL for notifications
- `CHROMA_HOST` - ChromaDB hostname
- `CHROMA_PORT` - ChromaDB port
**Secrets** (Repository Settings -> Secrets):
- `TRILIUM_PASS` - Trilium password
- `TRILIUM_TOKEN` - Trilium API token
- `GIT_PASS` - Git password or personal access token
- `N8N_SECRET` - n8n webhook secret key
- `OLLAMA_API_KEY` - Ollama API key for web search
### Workflow Triggers
The workflow runs automatically when:
- A push is made to the master branch
- The scheduled cron time is reached (18:15 UTC daily)
To trigger manually, push any change to master or modify the cron schedule in `.gitea/workflows/deploy.yml`.
## Customizing Agent Behavior
Agent personalities and task instructions are defined in YAML files under `src/ai_generators/crews/*/config/`. You can modify these without changing Python code:
- `research_crew/config/agents.yaml` - Researcher role, goal, backstory
- `research_crew/config/tasks.yaml` - Research task description
- `writing_crew/config/agents.yaml` - Four journalist personalities
- `writing_crew/config/tasks.yaml` - Writing task descriptions
- `editor_crew/config/agents.yaml` - Editor role, goal, backstory
- `editor_crew/config/tasks.yaml` - Editing task and metadata format
After editing YAML files, restart the application or container to apply changes.
## Troubleshooting
### Ollama Connection Errors
Ensure the Ollama server is running and accessible from the blog_creator container. Check `OLLAMA_HOST` and `OLLAMA_PORT` in your `.env` file.
### ChromaDB Connection Errors
Verify ChromaDB is running and the `CHROMA_HOST` and `CHROMA_PORT` variables are correct. In Docker Compose, use `chroma` as the host name.
### Ollama Web Search Errors
If the researcher agent fails with web search errors, check that `OLLAMA_API_KEY` is set correctly. Verify your Ollama subscription is active and has web search access.
### Empty Output
If blog posts are generated but empty, check:
- Ollama models are downloaded and available
- `CONTENT_CREATOR_MODELS` contains valid model names
- Sufficient timeout for model inference (default is 30 minutes per operation)
### Git Push Failures
Verify `GIT_USER` and `GIT_PASS` are correct and the user has write access to the remote repository. Check that the remote URL in `GIT_REMOTE` is accessible.
## Development Notes
- The `main.py` entry point should not be modified for normal operation
- All AI generation logic is in `src/ai_generators/`
- The Flow pattern allows easy addition of new crews or steps
- Vector database collections are named `blog_{title}_{random_id}` and persist across runs