Pagination is where many scraping jobs quietly fail. The parser still works, selectors still match, and the site still loads, but records go missing because the listing now advances with a cursor instead of a page number, or because a frontend swapped numbered links for a “Load more” button. This guide is a practical reference for handling the main pagination patterns in web scraping: offset pagination, cursor pagination, infinite scroll, and load more interfaces. It also explains what to track over time so your scraper keeps working as modern frontends and APIs evolve.
Overview
If you scrape lists of products, jobs, articles, reviews, search results, or directory entries, pagination is not a small implementation detail. It defines how you discover every item, how you deduplicate records, how you estimate coverage, and how you resume interrupted jobs.
In practice, pagination usually appears in one of four patterns:
- Offset or page-number pagination: requests include values like
?page=3or?offset=60&limit=20. - Cursor pagination: the next request depends on a token such as
cursor=abc123,after=..., or a timestamp/id boundary. - Load more pagination: a button triggers another request and appends more items to the current DOM.
- Infinite scroll: scrolling near the bottom causes the frontend to fetch and render the next batch automatically.
These patterns may look different in the browser, but the scraper’s job is always the same: identify the data source, understand the continuation mechanism, and stop cleanly without missing or duplicating items.
The most reliable workflow is usually:
- Inspect the page in your browser’s developer tools.
- Watch the network calls while you click the next page, press load more, or scroll.
- Prefer direct API or XHR requests when they expose the same data as the rendered page.
- Use a browser automation tool only when the continuation logic truly depends on client-side execution.
- Record stable checkpoints so the scraper can resume after failures.
That approach matters because DOM-first scraping often breaks earlier than request-level scraping. A button label can change. A container class can be renamed. But a JSON payload that drives the listing may remain fairly stable for longer.
If you are deciding between stacks, it helps to separate browser automation from HTML parsing. For JavaScript-heavy sites, browser tools are often useful for discovery and fallback. For server-rendered pages or stable endpoints, lightweight request-based scrapers are often easier to maintain. Related comparisons are covered in JavaScript Web Scraping in 2026: Puppeteer vs Playwright vs Cheerio and Python Web Scraping Stack Comparison: Requests vs BeautifulSoup vs Scrapy vs Playwright.
Pattern 1: Offset and page-number pagination
This is the most familiar pattern and often the simplest to scrape. You may see URLs such as:
/products?page=4/search?offset=120&limit=30/articles?p=2
Typical extraction tactic:
- Start from page 1 or offset 0.
- Increment by one page or by the requested batch size.
- Stop when the response is empty, shorter than expected, or repeats the previous page’s identifiers.
Offset pagination looks stable, but it has a hidden weakness: if the underlying list changes while you scrape, offsets can shift. New items inserted at the top can cause duplicates or skipped records downstream. For frequently changing datasets, store stable item IDs and deduplicate after collection.
Pattern 2: Cursor pagination
Cursor pagination is common in APIs and modern apps because it scales better for large datasets and changing lists. Instead of asking for “page 5,” the client asks for “the next batch after this known record or token.”
You may see parameters such as:
cursor=eyJpZCI6...after=last_seen_idnext_token=...updated_before=timestamp
Typical extraction tactic:
- Request the first batch.
- Extract both the records and the continuation token.
- Repeat until no next token is returned.
The main rule here is simple: do not try to fabricate the next cursor unless you fully understand the API contract. Many cursors are opaque by design. Treat them as pass-through values captured from the previous response.
Pattern 3: Load more interfaces
A load more button usually hides a paginated request behind a user action. To the user, the page stays the same while more items are appended. To the scraper, this often means there is still an underlying API call with predictable parameters.
Typical extraction tactic:
- Click the button manually while watching network requests.
- Identify whether a JSON, GraphQL, or HTML fragment request powers the appended content.
- Reproduce that request directly if possible.
- Use browser automation to click repeatedly only when no stable request interface is available.
If you automate the browser path, use a loop that detects when the button disappears, becomes disabled, or stops adding new item IDs.
Pattern 4: Infinite scroll
Infinite scroll is conceptually similar to load more, but the trigger is a scroll threshold instead of a click. The page watches the viewport, sends a request, and appends more cards or rows.
Typical extraction tactic:
- Scroll incrementally, not just once to the absolute bottom.
- Pause long enough for requests and rendering to finish.
- Track known item IDs after each scroll cycle.
- Stop when multiple scroll cycles produce no new IDs.
For many infinite scroll pages, the real work is still happening through an API call. The browser automation layer is just triggering it. Whenever possible, inspect the network panel and move to request-level scraping.
What to track
A scraper that handles pagination once is useful. A scraper that keeps handling it after the site changes is much more valuable. To get there, track the variables that usually drift over time.
1. The continuation mechanism
Record exactly how the next batch is requested:
- Query parameters like
page,offset,limit - Cursor fields like
cursor,after,endCursor - Headers required for XHR or fetch requests
- POST body values used by load more or GraphQL pagination
This is the first thing to check when a scraper suddenly plateaus early.
2. Batch size and expected page length
Track how many items usually arrive per request. If your scraper expects 24 items but starts receiving 12, the site may have changed layout rules, testing buckets, geolocation behavior, or filtering defaults.
Batch size is also useful for anomaly detection. A sudden drop in average items per page can signal throttling, partial rendering, or request rejection.
3. Stable item identifiers
Every paginated scraper should try to capture a stable per-item identifier such as a product ID, slug, canonical URL, or internal record key. This lets you:
- Deduplicate across shifting pages
- Detect repeated pages
- Measure true coverage
- Resume without relying only on page numbers
If no obvious ID exists, derive a fallback key from normalized URLs or a careful hash of critical fields.
4. Stop conditions
Define and log the reason the scraper stops. Good stop conditions include:
- No next page link or no next cursor returned
- Response contains zero items
- Response repeats a previously seen cursor or page
- No new item IDs after several load-more or scroll attempts
Weak stop conditions lead to silent loops, duplicated data, or incomplete collections.
5. Ordering fields
Track whether results are ordered by newest, relevance, price, popularity, or an opaque ranking. Pagination behaves differently depending on sorting. A rapidly changing “most recent” feed is much harder to scrape by offset than a stable alphabetical directory.
If the ordering can be controlled through parameters, store that explicitly. A stable sort often makes extraction and deduplication easier.
6. Request fingerprints and auth dependencies
Some paginated endpoints require:
- CSRF tokens
- Session cookies
- Authorization headers
- GraphQL operation names and variables
- Signed parameters or short-lived tokens
Track which values are static, which are session-bound, and which expire quickly. This is especially important for infinite scroll scraping where the page appears public but the underlying request includes hidden client state. A JWT decoder, JSON formatter, base64 decode tool, or URL encoder/decoder can speed up this analysis when tokens and payloads are involved.
7. Rendering dependency
Determine whether the next batch is available from plain HTTP requests or only after JavaScript execution. That single distinction affects stack choice, runtime cost, and failure modes.
If data is available in raw JSON, a lightweight scraper may be enough. If requests only appear after client-side logic runs, Playwright or Puppeteer may be the safer route.
8. Anti-bot responses and throttling signals
Pagination failures are not always parsing failures. Track signs of enforcement such as:
- Unexpected redirects
- Blank pages after several requests
- Reduced page sizes
- Captcha or challenge interstitials
- HTTP status changes or inconsistent payload schemas
Use polite request rates, caching, and sane retries. And review robots guidance and legal context for your use case in Robots.txt for Web Scraping: What It Means and What It Does Not and Web Scraping Legality Guide by Country: What Changes in 2026.
Cadence and checkpoints
The best way to keep pagination scraping reliable is to treat it as something you inspect on a schedule, not only when it breaks. This is especially useful for recurring datasets such as product catalogs, search results, or job listings.
Monthly checkpoints for stable targets
If the target site rarely changes, a monthly review is often enough. Check:
- Whether the continuation parameters are unchanged
- Whether the average batch size is still similar
- Whether total page counts or item counts changed in plausible ways
- Whether sample pages still return the same key fields
This kind of lightweight review catches gradual frontend migrations before a full failure occurs.
Quarterly checkpoints for broader workflow health
On a quarterly cadence, review not just the target site but your scraper design:
- Can a browser-based flow now be replaced with direct requests?
- Are there brittle selectors you can remove?
- Is your resume logic still robust?
- Are duplicates increasing due to feed churn?
- Do logging and alerts surface pagination failures clearly?
This is also a good time to revisit tool choices and update internal documentation.
Per-run checkpoints for production jobs
For automated pipelines, log a few pagination-specific checkpoints on every run:
- First page item count
- Last successful cursor or page number
- Total unique IDs collected
- Number of duplicate IDs encountered
- Stop reason
- Any non-200 responses, retries, or challenge pages
These logs turn debugging from guesswork into comparison. When a scraper breaks, you can see whether it stopped too early, looped, or started receiving partial data.
A simple checkpoint template
For each target, keep a small record with:
- Pagination type: offset, cursor, load more, infinite scroll
- Entry URL: the listing page you start from
- Request pattern: endpoint, method, params, headers, body
- Unique key: item ID or normalized URL
- Stop rule: no cursor, empty page, repeated IDs, button disabled
- Known risks: unstable sorting, short-lived token, regional results
- Last reviewed: date of manual validation
This is simple, but it makes recurring maintenance far easier.
How to interpret changes
When a pagination scraper starts underperforming, the visible symptom is often “fewer rows than usual.” The useful question is why. Different patterns point to different root causes.
If page counts drop suddenly
A sharp drop may mean:
- The site changed filters or default sorting
- The endpoint now requires an extra parameter
- You are being throttled or served partial content
- The scraper is failing after the first continuation step
Compare the first request and the second request side by side. Pagination bugs often appear not at the start, but at the handoff to the next batch.
If duplicates increase
Growing duplicates usually indicate one of three things:
- Offset-based scraping on a changing feed
- A broken cursor loop reusing the same continuation token
- Scroll or load-more automation that is not waiting for new content before repeating
The fix is usually to rely more heavily on unique IDs, stronger stop conditions, and explicit checks for repeated cursors or repeated last-item IDs.
If the DOM changed but the data did not
This is a common and often fixable case. If your selectors broke after a redesign, inspect the network calls before rewriting the whole scraper. The rendered markup may have changed completely while the backing JSON endpoint remained mostly the same.
If the request still works but the schema changed
When item fields move or are renamed, pagination may still succeed while downstream parsing fails. Separate your pagination logic from your extraction logic when possible. That way, you can verify that the scraper is still discovering all pages even if individual field mapping needs an update.
If browser automation became flaky
Flaky scroll timing, lazy rendering, and hydration issues are signs to check whether you can shift from UI-driven pagination to direct requests. Browser automation is powerful, but it should not be the default if a stable endpoint exists underneath.
When to revisit
Use this article as a recurring checklist whenever a paginated target is important enough to monitor over time. In practice, revisit your pagination strategy in five situations:
- After a frontend redesign: even small layout changes can hide a new request flow.
- When total collected records move outside the normal range: especially if the change is abrupt.
- When duplicate rates rise: often a sign of offset drift or repeated cursors.
- When anti-bot behavior appears: challenge pages and partial payloads can look like pagination bugs.
- On a monthly or quarterly review cadence: even when nothing seems broken.
A practical action plan looks like this:
- Classify the target as offset, cursor, load more, or infinite scroll.
- Identify the real data source in the network panel.
- Store stable item IDs and explicit stop conditions.
- Log the last successful page or cursor on every run.
- Compare expected versus actual unique item counts.
- Schedule a manual validation check on a recurring basis.
If you do only one thing differently, do this: stop treating pagination as just a loop. Treat it as a moving interface with observable signals. Once you log continuation tokens, page lengths, unique IDs, stop reasons, and enforcement signals, pagination becomes much easier to maintain.
That habit is what keeps a one-off scraper from turning into a recurring maintenance burden. And because pagination patterns change as sites move between server rendering, client-side apps, GraphQL, and hybrid architectures, this is one of the few scraping topics worth revisiting regularly.