HTTrack for Webflow: When the Old Crawler Still Works
HTTrack has been around since the late 90s. It's a recursive site crawler that downloads HTML and assets into a folder, mirroring the public structure. A surprising number of Webflow exits start with someone running HTTrack against their own published site — and a surprising number of those exits work fine.
This post is the honest comparison: where HTTrack wins (it's free and there's no SaaS in the way), where it falls apart on Webflow specifically, and what tools like Webflow Export actually do that HTTrack can't.
What HTTrack does, exactly
You point HTTrack at a URL. It loads the HTML, parses every link, image reference, CSS rule, and JS file, downloads them all, rewrites the URLs inside the HTML to point at the local files, and recurses into every internal link. The output is a folder structure mirroring the site, openable in a browser.
httrack "https://your-site.webflow.io" \
-O ./mirror \
"+*.webflow.io/*" "-mime:application/javascript" \
-%v
That command crawls a Webflow staging site into ./mirror/. Adjust to taste — HTTrack has dozens of flags for crawl depth, file patterns, and rate limiting.
What HTTrack gets right
Three real strengths, no marketing:
- It's free, offline, and yours. No SaaS account, no token, no per-export fee. The output is a folder on your disk. If you trust nobody and want full local control, this matters.
- It works on any platform. Webflow, Squarespace, WordPress, a static site you don't have credentials for — anything reachable by a browser is reachable by HTTrack. We can't crawl WordPress. HTTrack can crawl all of them.
- It handles arbitrary URL graphs. Modern exporters often assume a CMS shape. HTTrack doesn't assume anything; it just follows links. For sites with weird URL structures, custom code embeds linking to unexpected places, or oddly-organized media folders, HTTrack's “just crawl whatever” behavior is sometimes more robust.
Where HTTrack falls apart on Webflow
Five specific problems we've watched users hit:
1. CMS items it can't reach
HTTrack only sees what's linked from somewhere. If a Webflow site has a Blog Posts collection but no “all posts” index page (or the index uses JavaScript-rendered pagination that HTTrack can't follow), entire collections silently don't get crawled. The user sees a clean run with no errors and a missing third of the content.
A real exporter using the Webflow API enumerates collections directly. You can't miss what you've listed.
2. JavaScript-rendered content
Some Webflow interactions and certain CMS templates render parts of the page through Webflow.js. HTTrack downloads the JS file but doesn't execute it — the resulting mirror is missing whatever the script was adding. The site looks identical until you scroll and notice an empty container where a carousel should be.
3. Image URL rewriting is fragile
HTTrack rewrites URLs by string-matching the HTML. CSS background images set via Webflow's style system, srcset attributes for responsive images, and inline-style backgrounds sometimes get missed. The result is a mirror that loads — but loads images from uploads-ssl.webflow.com. You haven't actually escaped the Webflow CDN.
4. No CMS data extraction
HTTrack's output is pages, not collections. A mirrored blog is 200 separate .html files; rebuilding it on Next.js or Astro requires either keeping the static HTML forever or manually extracting content from the rendered pages. That's the worst possible source format for content migration.
5. Crawl politeness and bans
If your Webflow site is large and HTTrack hits it aggressively, Webflow's rate limiting can kick in mid-crawl. The mirror ends up partial, with no clear signal of which pages failed. We've seen mirrors that looked complete but had 12% of pages silently 404ed.
When HTTrack is the right call
Honestly, sometimes it's exactly the right tool:
- You need to archive a Webflow site you don't own and don't have API access to. This is the case where API-based exporters are non-starters. HTTrack works because all it needs is a URL.
- You're mirroring for archival purposes, not migration. You're trying to capture a snapshot for the Wayback Machine equivalent, not rebuild on a new stack.
- You have an unusual site that doesn't fit the CMS model. Hand-coded layouts, very custom Webflow setups, or sites with externally-embedded data sources sometimes work better with a generic crawler than a model-aware one.
For these cases, HTTrack is mature, free, and well-documented. Use it.
When to skip HTTrack
For anything else, the trade is straightforward:
| Situation | HTTrack | API-based exporter |
|---|---|---|
| You own the site and have API access | OK | Better — sees more |
| You need CMS as structured data, not just HTML | No | Yes |
| You need drafts or archived items | No | Yes (toggle on) |
| You're moving to Next.js / Astro / Hugo | Painful | Designed for this |
| You need a one-pass solution for a large site | Risky | Yes (batched API calls) |
| You need every CMS asset on local paths | Spotty | Yes (downloaded + hashed) |
| You're archiving someone else's site | Yes | Need their API token |
A short HTTrack tutorial for the cases where it's right
If you've decided HTTrack is the right tool, the practical setup:
# Install
brew install httrack # macOS
sudo apt install httrack # Debian/Ubuntu
# Crawl
httrack "https://example.com" \
--depth=10 \
--robots=0 \
--keep-alive \
--max-rate=200000 \
--user-agent="Mozilla/5.0 (compatible; mirror)" \
-O ./mirror
# Open the result
open ./mirror/index.html
Key flags:
--depth=N— how many link-hops from the start URL. 10 is usually safe overkill.--robots=0— ignore robots.txt (you own the site, this is fine; don't do this on sites you don't own).--max-rate=200000— bytes/sec cap, keeps you from triggering rate limits.--user-agent— HTTrack's default UA is sometimes blocked; setting it to a normal browser string helps.
Expect to spend an hour iterating on flags for a medium site before you have a clean mirror. That's part of the deal with HTTrack.
The summary
HTTrack is a sharp tool for a narrow set of jobs: archiving sites you don't control, mirroring for offline access, and the occasional unusual-site exit where a model-aware exporter doesn't fit. It's genuinely free and it works.
For the more common case — “I own a Webflow site and want to move it to a real stack” — an API-based tool gives you CMS as data, drafts, deterministic crawl completeness, and proper asset handling. That's Webflow Export. For the broader comparison across that category, Webflow Export vs ExFlow and vs NoCodeExport.