Features

Oori Web Scout is a tool for monitoring and summarizing web content. It processes a links file (e.g., links.md) based on configurable rules, fetching content from each URL and generating reports with LLM-powered summaries and insights.

Features

Flexible link format: Simple markdown-based format for managing monitored links
Multiple, selectable actions:
- random-remind: Randomly selects pages for periodic review with AI-generated reminders
- flag-update: Tracks content changes and flags substantive updates
Pluggable web fetchers: Protocol-based design supporting multiple web scraping backends
Knowledge Graphs: Generates Onya graphs from links and their content
Content caching: Stores markdown versions for update tracking
Tag-based filtering: Optional focus on specific content categories

Links File Format

The links file uses a simple markdown list format, in which each top-level list item is a target link and the sib-list items within are the fields that govern how that link is processed.

- https://example.com/
  - tags: tech | ai
  - action: random-remind
  - description: Optional description
  - key-quote: "Notable quote from the page"

- https://another-site.com/index.rss
  - type: rss-feed
  - action: flag-update
  - tags: news

Supported Fields

URL (required): The outer list item
type: webpage (default) or rss-feed
title: Override the page's title
action: random-remind (default) or flag-update
tags: Space or pipe-separated tags for filtering
description: Custom description
key-quote: Notable quote to highlight

Installation

Install OgbujiPT with required dependencies:

uv pip install -U "ogbujipt[mega]"

or to avoid installing a few dependencies you may not need:

uv pip install -U ogbujipt cssutils selectolax fire

(Optional) For JavaScript-heavy sites, set up Crawl4AI:

docker run -p 11235:11235 unclecode/crawl4ai:basic

Or you can be more detailed, e.g.

docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest

The dashboard should be at http://localhost:11235/dashboard

Usage

Basic usage with a local LLM endpoint:

webscout \
  --links-file=links.md \
  --output-dir=./output \
  --llm-url=http://localhost:8000 \
  --llm-model=llama-3.2-3b-instruct

Note: for this config you'll need to install a local LLM server, such as our sister project Toolio. Or you can use the better-known Ollama or LM Studio.

Command-Line Options

--links-file: Path to the links file (required)
--output-dir: Directory for output files (default: ./web_scout_output)
--llm-url: OpenAI-compatible LLM endpoint URL (or set OPENAI_API_BASE env var)
--llm-model: Model name (default: gpt-3.5-turbo)
--llm-api-key: API key (or set OPENAI_API_KEY env var)
--fetcher: Web fetcher to use: simple, crawl4ai, or fallback (default: simple)
--crawl4ai-url: Crawl4AI service URL (default: http://localhost:11235)
--random-remind-count: Number of pages to randomly select (default: 3)
--focus-tags: Comma-separated tags to focus on
--exclude-tags: Comma-separated tags to exclude
--verbose: Enable verbose logging

Examples

Focus on AI-related links only:

webscout \
  --links-file=links.md \
  --focus-tags=ai,tech \
  --output-dir=./output \
  --llm-url=http://localhost:8000

Use Crawl4AI for JavaScript-heavy sites:

webscout \
  --links-file=links.md \
  --fetcher=crawl4ai \
  --output-dir=./output \
  --llm-url=http://localhost:8000

Use fallback fetcher (tries simple first, falls back to Crawl4AI):

webscout \
  --links-file=links.md \
  --fetcher=fallback \
  --output-dir=./output \
  --llm-url=http://localhost:8000

Output

Web Scout generates three main outputs in the specified output directory:

report.txt: Human-readable report with:
- Summary statistics
- Random reminders with AI-generated summaries
- Update flags for changed content
- Error reports
web_scout.onya: Onya knowledge graph containing:
- Nodes for each monitored page
- Metadata (tags, actions, status)
- AI-generated summaries
cache/: Directory containing:
- Cached markdown content for each page
- Used for tracking changes in flag-update mode

Architecture

Web Fetcher Protocol

The tool uses a pluggable protocol for web fetching, making it easy to swap different backends:

SimpleHttpFetcher uses httpx + OgbujiPT HTML processing (fast, works for most sites)
Crawl4AIFetcher uses Crawl4AI for JavaScript-heavy sites (requires docker service)
FallbackFetcher tries simple first, falls back to Crawl4AI on failure

New fetchers can be added by implementing the WebFetcher protocol.

Actions

Actions are processed by dedicated handlers:

RandomRemindHandler randomly selects entries and generates reminder summaries
FlagUpdateHandler compares cached content with current content, uses LLM to detect substantive changes

Testing

Run the included tests to verify basic functionality:

# Test the parser
python test_parser.py

# Test the web fetcher
python test_fetcher.py

Notes

The simple fetcher works well for most static HTML sites
Use crawl4ai or fallback for sites requiring JavaScript execution
The flag-update action builds up a cache over time - first run establishes baselines
LLM costs scale with the number of links and content length
Substantive changes are determined by LLM judgment, tuned to ignore minor formatting changes

Web Scout started out as an OgbujiPT demo. It uses ogbujipt.llm.wrapper for LLM interactions, ogbujipt.text.html for HTML processing, ogbujipt.store.kgraph concepts for Onya graphs, and standard OgbujiPT patterns such as async, httpx and structlog.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
demo		demo
pylib		pylib
resource		resource
test		test
.gitignore		.gitignore
AICONTEXT.md		AICONTEXT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Features

Links File Format

Supported Fields

Installation

Usage

Command-Line Options

Examples

Output

Architecture

Web Fetcher Protocol

Actions

Testing

Notes

About

Uh oh!

Releases

Packages

Languages

License

OoriData/WebScout

Folders and files

Latest commit

History

Repository files navigation

Features

Links File Format

Supported Fields

Installation

Usage

Command-Line Options

Examples

Output

Architecture

Web Fetcher Protocol

Actions

Testing

Notes

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages