forked from mangal/mangal
Release v5.0.0: Browser impersonation, CAPTCHA solving, proxy support, Go 1.24 #1
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "augustohp/up"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Major release bringing mangal back to life after years unmaintained.
Highlights
See CHANGELOG.md for full details.
Adds proxy.url, proxy.username, and proxy.password config keys (also available as MANGAL_PROXY_* env vars). When set, the proxy applies to both the Chrome-impersonated and plain transports via a new network.SetupProxy dispatcher, so page downloads and scraper requests honour the same setting regardless of impersonation mode. Embeds the Bright Data proxy CA certificate in the binary so HTTPS through their MITM WebUnlocker works without system-level cert setup. The CA is injected into both the utls client config (direct TLS handshake) and the transport TLS config (CONNECT proxy path) because the utls handshake is bypassed when going through HTTP CONNECT. Adds a SourceOptions() hook so Lua scrapers can declare metadata such as {requires_proxy = true}, giving future tooling a way to skip or warn when a required proxy is absent. patchHTTPClient now always runs (not only when impersonation is on) so the BrightData CA is injected into plain Lua HTTP transports too. When swapping a Lua client's transport for the impersonated one, any proxy already set on the original transport is preserved. Disabling TLS verification was considered and rejected; embedding the CA is the correct fix. Using HTTP_PROXY env-only was also dismissed in favour of explicit config so the password can be passed separately via MANGAL_PROXY_PASSWORD without leaking it into the URL.Previously, configuring proxy.url applied the proxy to every source, meaning any scrape — even for sites that work without a proxy — would incur residential proxy costs. The new proxy.mode key (default: auto) controls scope: auto — proxy only reaches sources whose SourceOptions() returns requires_proxy=true, or that are listed in proxy.sources. A cost-reminder is logged at source-load time. always — previous behaviour; every source routes through the proxy. off — proxy is never used; sources that require it fail fast with an actionable error rather than silently misbehaving. "always" was considered as the default for backwards compatibility, but "auto" is safer because users who set proxy.url expecting selective behaviour would be surprised to pay for every source. The breaking change is documented below. httpclient.Factory gains SetAutoProxy/AutoProxyOptions/HasProxy to store proxy opts separately from global defaults, so the loader can wire them per-source without touching the global transport. BREAKING CHANGE: proxy.mode defaults to "auto". Users with proxy.url set who relied on the old all-sources behaviour should run: mangal config set proxy.mode alwaysManga hosting sites increasingly gate content behind Cloudflare Turnstile, reCAPTCHA v2/v3, and image-text challenges. This adds an opt-in, per-source CAPTCHA-solving pipeline that integrates with external solver APIs so scrapers can transparently bypass challenges. Architecture: captcha/solver.go — Solver interface (Solve/Balance/Name), task and solution types, provider registry, global accessor (SetGlobalSolver/GlobalSolver) captcha/detect.go — Header-first challenge detection; inspects status + Cf-Mitigated header before reading up to 256 KB of body for Turnstile/reCAPTCHA fingerprints. Always restores resp.Body. captcha/middleware.go — HTTP RoundTripper wrapper: cache check → detect → solve (singleflight-deduped) → cache cookies → retry with solution applied captcha/cookiejar.go — Per-domain solution cookie cache with TTL and lazy eviction; injectable clock for tests captcha/capmonster/ — CapMonster Cloud provider (self-registers) captcha/twocaptcha/ — 2Captcha + Anti-Captcha provider (shared API protocol, per-instance Name()) Integration: Lua scrapers declare `requires_captcha_solver = true` in their SourceOptions table. The loader (provider/custom/loader.go) wires the middleware via SetSourceOptions — which now appends rather than replaces, so proxy + captcha coexist on the same source. Three config keys control the feature: captcha.provider — capmonster | 2captcha | anticaptcha captcha.api_key — API key (prefer MANGAL_CAPTCHA_API_KEY env) captcha.cookie_ttl — cache duration (default 15m) A shared global SolutionJar means solve results for a domain are reused across scrapers on the same CDN, and singleflight prevents concurrent requests from paying for duplicate solves. Design decisions: - Solver middleware wraps the outermost layer of the transport so retries traverse impersonation + error context, preserving Chrome-like TLS fingerprints on the retry request. - Detection is header-first: only reads the body when status is 403/503 AND Cloudflare headers are present, adding zero overhead to normal responses. - Per-source opt-in (not global) was chosen because solver API calls cost real money — accidental activation on every source would surprise users. Sources without the flag are unaffected. - Immediate first poll (before ticker) in both providers halves latency for fast solves from the solver API. Considered but rejected: - Browser-based solving (Playwright/Rod): too heavy for a CLI tool, requires headless Chrome, and is fragile across versions. - Global middleware on all sources: would silently consume API credits on sites that don't need it. - Per-source SolutionJar: wastes solve results when multiple scrapers hit the same CDN domain.