Web Intelligence & Price Monitoring with OpenClaw — Alternative Data Part 5 of 5

ℹ️ Informational only. Alternative data sources vary in accuracy and timeliness. Nothing here is investment advice. Always verify data independently before making any decisions.

Ethics & Responsibility First

Critical: Web scraping occupies a legal and ethical gray zone. Before building any scraper, always:

Check robots.txt — if a path is disallowed, don't scrape it
Respect rate limits — add delays between requests, use caching
Read Terms of Service — some sites explicitly prohibit automated access
Never circumvent authentication or paywalls
Don't collect personal data

The data in this guide is from public-facing pages with no login required.

Four Scraping Targets

Target 1: Amazon Product Rankings

Amazon Best Seller rank is public, no auth required
Track rank changes for competitor products weekly
Review count growth as a proxy for sales velocity

Target 2: Competitor Pricing

Public pricing pages (SaaS pricing, retail product pages)
Track price changes over time
Alert on promotions, price drops, or increases

Target 3: Job Board Aggregate Count

Company career pages often show total open roles without needing Greenhouse
Track headcount direction via total job count

Target 4: Press Release & IR Page Monitoring

Company investor relations pages publish press releases, earnings dates, events
Monitor for new content without relying on SEC filings alone

HEARTBEAT Configuration

name: web_intel_monitor
schedule: "0 7 * * 1,4"
steps:
  - scrape:
      targets:
        - name: "Competitor A pricing"
          url: "https://competitor-a.com/pricing"
          selector: ".pricing-card .price"
        - name: "Amazon product rank"
          url: "https://www.amazon.com/dp/B0EXAMPLE"
          selector: "#SalesRank"
      respect_robots_txt: true
      delay_seconds: 3
      cache_ttl_hours: 12
  - compare:
      to: last_run
      alert_on_change: true
  - llm:
      prompt: |
        Summarize what changed across these web intelligence targets since last check.
        Flag: price changes, rank movements, new content on IR pages.
        Data: {{ scrape_results }}
  - notify:
      subject: "🌐 Web Intel Update — {{ date }}"
      condition: changes_detected

Responsible Scraping Implementation

Check robots.txt Before Scraping

import httpx
import time
import urllib.robotparser

def can_scrape(url: str, user_agent: str = "*") -> bool:
    """Check robots.txt before scraping."""
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp.can_fetch(user_agent, url)

def scrape_with_respect(url: str, selector: str, delay: float = 3.0) -> str:
    """Scrape a single element with robots.txt check and rate limiting."""
    if not can_scrape(url):
        raise PermissionError(f"robots.txt disallows scraping: {url}")
    time.sleep(delay)
    from bs4 import BeautifulSoup
    r = httpx.get(url, headers={"User-Agent": "AltDataBot/1.0 contact@youremail.com"},
                  follow_redirects=True, timeout=10)
    soup = BeautifulSoup(r.text, "html.parser")
    el = soup.select_one(selector)
    return el.get_text(strip=True) if el else ""

Price Change Detection with Caching

import json, os
from datetime import datetime

CACHE_FILE = "price_cache.json"

def detect_price_change(product_id: str, current_price: float) -> dict:
    cache = {}
    if os.path.exists(CACHE_FILE):
        with open(CACHE_FILE) as f:
            cache = json.load(f)
    prior = cache.get(product_id, {}).get("price")
    cache[product_id] = {"price": current_price, "updated": datetime.now().isoformat()}
    with open(CACHE_FILE, "w") as f:
        json.dump(cache, f)
    if prior is None:
        return {"status": "first_observation", "price": current_price}
    change_pct = ((current_price - prior) / prior) * 100
    return {"status": "changed" if abs(change_pct) > 0.01 else "unchanged",
            "prior": prior, "current": current_price, "change_pct": round(change_pct, 2)}

Multi-Signal Aggregation: Bringing It All Together

The real edge comes from combining signals. Example: a company showing (1) rising Google Trends brand interest, (2) aggressive engineering hiring, (3) improving App Store ratings, and (4) competitor price drops on Amazon — that's a convergent signal worth deeper research. Build a weekly "alternative data scorecard" that pulls from all five pipelines.

OpenClaw's scheduling and LLM steps make this natural: fetch all five data sources in parallel each Monday, aggregate into a single narrative, and alert only on multi-signal convergences. This dramatically reduces noise compared to single-source alerts.

Frequently Asked Questions

Q: Is BeautifulSoup still the right tool?

For simple HTML parsing, yes. For JavaScript-rendered pages, use playwright or selenium — but check if a static API endpoint exists first.

Q: How do I avoid getting blocked?

Rotate User-Agent strings, add delays, cache aggressively, and don't hammer pages. If a site rate-limits you, back off.

Q: Can I scrape Amazon?

Amazon's ToS prohibits scraping. The Product Advertising API (PA API) is the official route and requires an Associates account. Use cautiously.

What's Next?

You now have five complete alternative data pipelines. The next step is automation: deploy these as HEARTBEAT schedules on a recurring basis (weekly, daily, or as-needed). Aggregate the outputs into a unified dashboard or weekly brief. The companies that win at alternative data aren't the ones with access to expensive data — they're the ones who can coordinate multiple free signals into a coherent narrative. OpenClaw makes that coordination straightforward.