Release v5.0.0: Browser impersonation, CAPTCHA solving, proxy support, Go 1.24 #1

Merged
marvin merged 20 commits from augustohp/up into main 2026-03-23 00:37:39 +00:00
Owner

Summary

Major release bringing mangal back to life after years unmaintained.

Highlights

  • Chrome TLS/HTTP fingerprint impersonation (anti-bot bypass)
  • Automated CAPTCHA solving (CapMonster + 2Captcha)
  • Per-source proxy routing
  • Go 1.18 → 1.24 upgrade
  • Removed dead built-in sources (Manganato, Manganelo)
  • Restored Mangapill
  • Woodpecker CI pipeline

See CHANGELOG.md for full details.

## Summary Major release bringing mangal back to life after years unmaintained. ### Highlights - Chrome TLS/HTTP fingerprint impersonation (anti-bot bypass) - Automated CAPTCHA solving (CapMonster + 2Captcha) - Per-source proxy routing - Go 1.18 → 1.24 upgrade - Removed dead built-in sources (Manganato, Manganelo) - Restored Mangapill - Woodpecker CI pipeline See CHANGELOG.md for full details.
Adds AGENTS.md with repository guidance intended for AI agents (layout,
packages, workflows, and conventions) so automated changes are more
consistent and easier to review.

Updates the Dockerfile to build the binary in a builder stage, embed build
metadata (BuiltAt/BuiltBy/Revision), and copy the built artifact into the
runtime image. This was done while setting up a devcontainer workspace.
Moves the module from Go 1.18 to Go 1.24 and refreshes the dependency
set, including a full vendor update, so builds use a supported toolchain
and current upstream APIs.

Replaces golang.org/x/exp usage with stdlib slices/cmp, updates SortFunc
comparators to the new int-return signature, and adapts lipgloss Style
rendering to its variadic API to keep the TUI compiling cleanly.

Updates the PDF converter to pdfcpu v0.11’s high-level ImportImages API
after upstream removed the prior low-level entrypoints.

Updates Docker build images to golang:1.24-* and runs go mod tidy/vendor
to align go.mod/go.sum and vendored sources.
Adds a scrapers-tagged Go test harness that loads each Lua scraper, reads its
TestQueries() table, and runs a Search → Chapters → Pages smoke flow against
live sites.

Keeps these networked checks out of the default unit test path via a separate
make target and build tag to reduce flakiness in normal CI/dev loops.

Stores “known-good” queries next to scraper logic (instead of hardcoding in Go)
so scraper maintenance includes keeping its test inputs valid.
Manganelo and Manganato no longer work, so they are removed from the built-in
provider set and from init-time registration.

Built-in sources are now Mangadex and Mangapill, and documentation is updated
to match.

CHANGELOG.md now starts the 5.0.0 section to frame this as a maintenance
release focused on restoring a working baseline, and tests are adjusted to
validate provider lookup against an existing built-in source.
go vet treats LState.RaiseError as printf-style and flags dynamic
strings passed as the format argument.

Wrap error messages with a constant "%s" format string so the
message is treated as data, satisfying vet and avoiding accidental
format expansion when messages include percent sequences.
This is a patch-level change with no user-visible or runtime behavior
impact.

Use keyed A:/B: fields for all lo.Tuple2 struct literals so go vet no
longer warns about unkeyed fields, keeping maintenance friction low.
Mangapill search was sending an incorrectly encoded query. It
replaced spaces with '+' and then QueryEscape encoded '+' as
%2B, so the server received "death%2Bnote" instead of "death+note".

The generic provider was also sending a Host header that included
the URL scheme. It now parses BaseURL and uses the hostname.

Manga extraction relied on a brittle selector with breakpoint
classes and a rigid ancestor chain. It now uses a shorter selector
that tolerates small DOM changes.
Many manga sites use TLS and HTTP/2 fingerprinting (Cloudflare, Akamai)
to distinguish Go's net/http from real browsers and block automated
clients. This adds browser impersonation via imroc/req, which uses uTLS
to produce a Chrome-identical TLS ClientHello and HTTP/2 SETTINGS frame.

Three HTTP paths needed coverage:
- Lua scrapers: wraps package.preload["http"] to patch the transport on
  every http.client() call, avoiding vendored code modifications
- Page/API downloads: swaps network.Client.Transport at startup
- Colly-based builtins: injects via baseCollector.WithTransport()

All share a singleton req.Client configured with ImpersonateChrome().

Enabled by default (downloader.impersonate = true). Can be toggled off
via config or MANGAL_DOWNLOADER_IMPERSONATE=false.

Considered modifying the vendored mangal-lua-libs http/client directly,
but GOFLAGS=-mod=readonly in the devcontainer causes Go to compile from
the module cache rather than vendor/, making vendored patches invisible
to the build. The Lua preloader-wrapping approach avoids this entirely.

A CGO-based libcurl-impersonate wrapper was also evaluated but rejected:
it breaks cross-compilation and inflates binaries. Pure-Go uTLS covers
the vast majority of fingerprinting defenses without a C toolchain.
Adds proxy.url, proxy.username, and proxy.password config keys
(also available as MANGAL_PROXY_* env vars). When set, the proxy
applies to both the Chrome-impersonated and plain transports via a
new network.SetupProxy dispatcher, so page downloads and scraper
requests honour the same setting regardless of impersonation mode.

Embeds the Bright Data proxy CA certificate in the binary so HTTPS
through their MITM WebUnlocker works without system-level cert setup.
The CA is injected into both the utls client config (direct TLS
handshake) and the transport TLS config (CONNECT proxy path) because
the utls handshake is bypassed when going through HTTP CONNECT.

Adds a SourceOptions() hook so Lua scrapers can declare metadata such
as {requires_proxy = true}, giving future tooling a way to skip or
warn when a required proxy is absent.

patchHTTPClient now always runs (not only when impersonation is on)
so the BrightData CA is injected into plain Lua HTTP transports too.
When swapping a Lua client's transport for the impersonated one, any
proxy already set on the original transport is preserved.

Disabling TLS verification was considered and rejected; embedding the
CA is the correct fix. Using HTTP_PROXY env-only was also dismissed in
favour of explicit config so the password can be passed separately via
MANGAL_PROXY_PASSWORD without leaking it into the URL.
Every consumer now obtains *http.Client instances from a single
httpclient.Factory instead of sharing a global network.Client
var or calling impersonate helpers directly.

The factory uses a functional-options pattern (WithImpersonation,
WithProxy, WithRootCAs, etc.) and produces fully-configured
clients with the correct transport chain. Per-source overrides
are supported via SetSourceOptions, and Lua scrapers can still
specify their own proxy which is applied as an additional
override on top of the factory config.

Removed in favour of the factory:
- network.Client (global *http.Client var)
- network.SetupImpersonation / SetupProxy
- impersonatedRoundTripper helper in provider/custom
- Direct usage of the impersonate package in main.go, network,
  and provider/custom

The impersonate package itself is kept (httpclient uses it
internally via imroc/req) but no longer referenced from
application wiring or tests.
Clearing a string config key via `mangal config set` could hit an empty
value slice and panic (seen when clearing proxy configuration).

Treat missing string values as an explicit empty string so proxy.url can
be cleared safely. Reject missing values for boolean and integer keys to
fail fast with a clear error.

Adds regression tests covering empty, omitted, and normal string values.
A dead BrightData proxy silently caused all chapter downloads to
fail with opaque 502/403 errors. No diagnostic information showed
which proxy was in use, whether impersonation was active, or how
to verify the pipeline independently.

Wraps every factory-built transport with an error context
middleware that appends proxy URL (redacted) and impersonation
status to transport errors, and logs HTTP 4xx/5xx responses.
Adds "mangal config test" to send a HEAD request through the
full HTTP pipeline, reporting proxy, impersonation, and latency.

Other fixes included:
- WithProxy with an invalid URL no longer silently disables
  ProxyFromEnvironment (proxySet only set after successful parse)
- WithProxyFunc now takes a rawURL parameter for accurate
  redacted display, replacing the separate WithProxyRawURL option
- RedactProxyURL consolidated to url.URL.Redacted() across both
  the network and httpclient packages
- config test uses RunE with immediate Body.Close() instead of
  Run+os.Exit which skipped the deferred close

Considered keeping WithProxyFunc and WithProxyRawURL as separate
options but merged them to eliminate the coupling foot-gun where
callers could forget to set one. The rawURL="" default preserves
backward compatibility for Lua proxy overrides that lack a URL.
Previously, configuring proxy.url applied the proxy to every source,
meaning any scrape — even for sites that work without a proxy — would
incur residential proxy costs.

The new proxy.mode key (default: auto) controls scope:
  auto   — proxy only reaches sources whose SourceOptions() returns
           requires_proxy=true, or that are listed in proxy.sources.
           A cost-reminder is logged at source-load time.
  always — previous behaviour; every source routes through the proxy.
  off    — proxy is never used; sources that require it fail fast with
           an actionable error rather than silently misbehaving.

"always" was considered as the default for backwards compatibility, but
"auto" is safer because users who set proxy.url expecting selective
behaviour would be surprised to pay for every source. The breaking
change is documented below.

httpclient.Factory gains SetAutoProxy/AutoProxyOptions/HasProxy to
store proxy opts separately from global defaults, so the loader can
wire them per-source without touching the global transport.

BREAKING CHANGE: proxy.mode defaults to "auto". Users with proxy.url
set who relied on the old all-sources behaviour should run:
  mangal config set proxy.mode always
Manga hosting sites increasingly gate content behind Cloudflare
Turnstile, reCAPTCHA v2/v3, and image-text challenges. This adds
an opt-in, per-source CAPTCHA-solving pipeline that integrates with
external solver APIs so scrapers can transparently bypass challenges.

Architecture:
  captcha/solver.go     — Solver interface (Solve/Balance/Name), task
                          and solution types, provider registry, global
                          accessor (SetGlobalSolver/GlobalSolver)
  captcha/detect.go     — Header-first challenge detection; inspects
                          status + Cf-Mitigated header before reading
                          up to 256 KB of body for Turnstile/reCAPTCHA
                          fingerprints. Always restores resp.Body.
  captcha/middleware.go — HTTP RoundTripper wrapper: cache check →
                          detect → solve (singleflight-deduped) →
                          cache cookies → retry with solution applied
  captcha/cookiejar.go  — Per-domain solution cookie cache with TTL
                          and lazy eviction; injectable clock for tests
  captcha/capmonster/   — CapMonster Cloud provider (self-registers)
  captcha/twocaptcha/   — 2Captcha + Anti-Captcha provider (shared API
                          protocol, per-instance Name())

Integration:
  Lua scrapers declare `requires_captcha_solver = true` in their
  SourceOptions table. The loader (provider/custom/loader.go) wires
  the middleware via SetSourceOptions — which now appends rather than
  replaces, so proxy + captcha coexist on the same source.

  Three config keys control the feature:
    captcha.provider   — capmonster | 2captcha | anticaptcha
    captcha.api_key    — API key (prefer MANGAL_CAPTCHA_API_KEY env)
    captcha.cookie_ttl — cache duration (default 15m)

  A shared global SolutionJar means solve results for a domain are
  reused across scrapers on the same CDN, and singleflight prevents
  concurrent requests from paying for duplicate solves.

Design decisions:
  - Solver middleware wraps the outermost layer of the transport
    so retries traverse impersonation + error context, preserving
    Chrome-like TLS fingerprints on the retry request.
  - Detection is header-first: only reads the body when status is
    403/503 AND Cloudflare headers are present, adding zero overhead
    to normal responses.
  - Per-source opt-in (not global) was chosen because solver API
    calls cost real money — accidental activation on every source
    would surprise users. Sources without the flag are unaffected.
  - Immediate first poll (before ticker) in both providers halves
    latency for fast solves from the solver API.

Considered but rejected:
  - Browser-based solving (Playwright/Rod): too heavy for a CLI tool,
    requires headless Chrome, and is fragile across versions.
  - Global middleware on all sources: would silently consume API
    credits on sites that don't need it.
  - Per-source SolutionJar: wastes solve results when multiple
    scrapers hit the same CDN domain.
The scraper test harness now calls config.Setup before building the HTTP
factory, so viper loads mangal.toml and env bindings exactly like main.go.

The proxy URL is read via viper.GetString(key.ProxyURL) and, when enabled,
wired into the factory as an auto-proxy. Scrapers that fail only because a
proxy is missing are skipped, keeping CI green while still exercising them
when MANGAL_PROXY_URL (or toml) is configured.
Directory creation failures were previously ignored, which could lead to later
save failures without clear root cause. This change propagates errors when
creating the downloads directory and wraps permission-related save errors with
actionable context, including the configured directory path when available.

The wrapper preserves `errors.Is(..., os.ErrPermission)` behavior. Tests cover
`os.ErrPermission`, `*os.PathError`, nil input, and pass-through for non-perm
errors.
Searching across multiple sources could panic or race when a source
failed to initialize. loadSources previously shared an err across
goroutines and wrote into a pre-sized slice by index.

Collect successfully created sources under a mutex and let each
goroutine handle its own init error. searchManga now skips nil sources,
returns after Search errors, and guards concurrent appends to the
results slice.

Considered canceling the whole search on first failure, but partial
results are more useful than aborting.
Add Woodpecker CI pipeline
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
27375dd467
marvin merged commit 27375dd467 into main 2026-03-23 00:37:39 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
mangal2/mangal!1
No description provided.