Ethics & Responsibility First
- Check robots.txt — if a path is disallowed, don't scrape it
- Respect rate limits — add delays between requests, use caching
- Read Terms of Service — some sites explicitly prohibit automated access
- Never circumvent authentication or paywalls
- Don't collect personal data
The data in this guide is from public-facing pages with no login required.
Four Scraping Targets
Target 1: Amazon Product Rankings
- Amazon Best Seller rank is public, no auth required
- Track rank changes for competitor products weekly
- Review count growth as a proxy for sales velocity
Target 2: Competitor Pricing
- Public pricing pages (SaaS pricing, retail product pages)
- Track price changes over time
- Alert on promotions, price drops, or increases
Target 3: Job Board Aggregate Count
- Company career pages often show total open roles without needing Greenhouse
- Track headcount direction via total job count
Target 4: Press Release & IR Page Monitoring
- Company investor relations pages publish press releases, earnings dates, events
- Monitor for new content without relying on SEC filings alone
HEARTBEAT Configuration
name: web_intel_monitor
schedule: "0 7 * * 1,4"
steps:
- scrape:
targets:
- name: "Competitor A pricing"
url: "https://competitor-a.com/pricing"
selector: ".pricing-card .price"
- name: "Amazon product rank"
url: "https://www.amazon.com/dp/B0EXAMPLE"
selector: "#SalesRank"
respect_robots_txt: true
delay_seconds: 3
cache_ttl_hours: 12
- compare:
to: last_run
alert_on_change: true
- llm:
prompt: |
Summarize what changed across these web intelligence targets since last check.
Flag: price changes, rank movements, new content on IR pages.
Data: {{ scrape_results }}
- notify:
subject: "🌐 Web Intel Update — {{ date }}"
condition: changes_detected
Responsible Scraping Implementation
Check robots.txt Before Scraping
import httpx
import time
import urllib.robotparser
def can_scrape(url: str, user_agent: str = "*") -> bool:
"""Check robots.txt before scraping."""
from urllib.parse import urlparse
parsed = urlparse(url)
robots_url = f"{parsed.scheme}://{parsed.netloc}/robots.txt"
rp = urllib.robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
return rp.can_fetch(user_agent, url)
def scrape_with_respect(url: str, selector: str, delay: float = 3.0) -> str:
"""Scrape a single element with robots.txt check and rate limiting."""
if not can_scrape(url):
raise PermissionError(f"robots.txt disallows scraping: {url}")
time.sleep(delay)
from bs4 import BeautifulSoup
r = httpx.get(url, headers={"User-Agent": "AltDataBot/1.0 contact@youremail.com"},
follow_redirects=True, timeout=10)
soup = BeautifulSoup(r.text, "html.parser")
el = soup.select_one(selector)
return el.get_text(strip=True) if el else ""
Price Change Detection with Caching
import json, os
from datetime import datetime
CACHE_FILE = "price_cache.json"
def detect_price_change(product_id: str, current_price: float) -> dict:
cache = {}
if os.path.exists(CACHE_FILE):
with open(CACHE_FILE) as f:
cache = json.load(f)
prior = cache.get(product_id, {}).get("price")
cache[product_id] = {"price": current_price, "updated": datetime.now().isoformat()}
with open(CACHE_FILE, "w") as f:
json.dump(cache, f)
if prior is None:
return {"status": "first_observation", "price": current_price}
change_pct = ((current_price - prior) / prior) * 100
return {"status": "changed" if abs(change_pct) > 0.01 else "unchanged",
"prior": prior, "current": current_price, "change_pct": round(change_pct, 2)}
Multi-Signal Aggregation: Bringing It All Together
The real edge comes from combining signals. Example: a company showing (1) rising Google Trends brand interest, (2) aggressive engineering hiring, (3) improving App Store ratings, and (4) competitor price drops on Amazon — that's a convergent signal worth deeper research. Build a weekly "alternative data scorecard" that pulls from all five pipelines.
OpenClaw's scheduling and LLM steps make this natural: fetch all five data sources in parallel each Monday, aggregate into a single narrative, and alert only on multi-signal convergences. This dramatically reduces noise compared to single-source alerts.
Frequently Asked Questions
For simple HTML parsing, yes. For JavaScript-rendered pages, use playwright or selenium — but check if a static API endpoint exists first.
Rotate User-Agent strings, add delays, cache aggressively, and don't hammer pages. If a site rate-limits you, back off.
Amazon's ToS prohibits scraping. The Product Advertising API (PA API) is the official route and requires an Associates account. Use cautiously.
What's Next?
You now have five complete alternative data pipelines. The next step is automation: deploy these as HEARTBEAT schedules on a recurring basis (weekly, daily, or as-needed). Aggregate the outputs into a unified dashboard or weekly brief. The companies that win at alternative data aren't the ones with access to expensive data — they're the ones who can coordinate multiple free signals into a coherent narrative. OpenClaw makes that coordination straightforward.