Skip to main content
🌐 PART 5 OF 5

Web Intelligence & Price Monitoring — Responsible Scraping for Market Signals

ℹ️ Informational only. Alternative data sources vary in accuracy and timeliness. Nothing here is investment advice. Always verify data independently before making any decisions.

Ethics & Responsibility First

Critical: Web scraping occupies a legal and ethical gray zone. Before building any scraper, always:
  1. Check robots.txt — if a path is disallowed, don't scrape it
  2. Respect rate limits — add delays between requests, use caching
  3. Read Terms of Service — some sites explicitly prohibit automated access
  4. Never circumvent authentication or paywalls
  5. Don't collect personal data

The data in this guide is from public-facing pages with no login required.

Four Scraping Targets

Target 1: Amazon Product Rankings

Target 2: Competitor Pricing

Target 3: Job Board Aggregate Count

Target 4: Press Release & IR Page Monitoring

HEARTBEAT Configuration

name: web_intel_monitor
schedule: "0 7 * * 1,4"
steps:
  - scrape:
      targets:
        - name: "Competitor A pricing"
          url: "https://competitor-a.com/pricing"
          selector: ".pricing-card .price"
        - name: "Amazon product rank"
          url: "https://www.amazon.com/dp/B0EXAMPLE"
          selector: "#SalesRank"
      respect_robots_txt: true
      delay_seconds: 3
      cache_ttl_hours: 12
  - compare:
      to: last_run
      alert_on_change: true
  - llm:
      prompt: |
        Summarize what changed across these web intelligence targets since last check.
        Flag: price changes, rank movements, new content on IR pages.
        Data: {{ scrape_results }}
  - notify:
      subject: "🌐 Web Intel Update — {{ date }}"
      condition: changes_detected

Responsible Scraping Implementation

Check robots.txt Before Scraping

import httpx
import time
import urllib.robotparser

def can_scrape(url: str, user_agent: str = "*") -> bool:
    """Check robots.txt before scraping."""
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    return rp.can_fetch(user_agent, url)

def scrape_with_respect(url: str, selector: str, delay: float = 3.0) -> str:
    """Scrape a single element with robots.txt check and rate limiting."""
    if not can_scrape(url):
        raise PermissionError(f"robots.txt disallows scraping: {url}")
    time.sleep(delay)
    from bs4 import BeautifulSoup
    r = httpx.get(url, headers={"User-Agent": "AltDataBot/1.0 contact@youremail.com"},
                  follow_redirects=True, timeout=10)
    soup = BeautifulSoup(r.text, "html.parser")
    el = soup.select_one(selector)
    return el.get_text(strip=True) if el else ""

Price Change Detection with Caching

import json, os
from datetime import datetime

CACHE_FILE = "price_cache.json"

def detect_price_change(product_id: str, current_price: float) -> dict:
    cache = {}
    if os.path.exists(CACHE_FILE):
        with open(CACHE_FILE) as f:
            cache = json.load(f)
    prior = cache.get(product_id, {}).get("price")
    cache[product_id] = {"price": current_price, "updated": datetime.now().isoformat()}
    with open(CACHE_FILE, "w") as f:
        json.dump(cache, f)
    if prior is None:
        return {"status": "first_observation", "price": current_price}
    change_pct = ((current_price - prior) / prior) * 100
    return {"status": "changed" if abs(change_pct) > 0.01 else "unchanged",
            "prior": prior, "current": current_price, "change_pct": round(change_pct, 2)}

Multi-Signal Aggregation: Bringing It All Together

The real edge comes from combining signals. Example: a company showing (1) rising Google Trends brand interest, (2) aggressive engineering hiring, (3) improving App Store ratings, and (4) competitor price drops on Amazon — that's a convergent signal worth deeper research. Build a weekly "alternative data scorecard" that pulls from all five pipelines.

OpenClaw's scheduling and LLM steps make this natural: fetch all five data sources in parallel each Monday, aggregate into a single narrative, and alert only on multi-signal convergences. This dramatically reduces noise compared to single-source alerts.

Frequently Asked Questions

Q: Is BeautifulSoup still the right tool?

For simple HTML parsing, yes. For JavaScript-rendered pages, use playwright or selenium — but check if a static API endpoint exists first.

Q: How do I avoid getting blocked?

Rotate User-Agent strings, add delays, cache aggressively, and don't hammer pages. If a site rate-limits you, back off.

Q: Can I scrape Amazon?

Amazon's ToS prohibits scraping. The Product Advertising API (PA API) is the official route and requires an Associates account. Use cautiously.

What's Next?

You now have five complete alternative data pipelines. The next step is automation: deploy these as HEARTBEAT schedules on a recurring basis (weekly, daily, or as-needed). Aggregate the outputs into a unified dashboard or weekly brief. The companies that win at alternative data aren't the ones with access to expensive data — they're the ones who can coordinate multiple free signals into a coherent narrative. OpenClaw makes that coordination straightforward.