Handle Pagination in Web Scraping

A practical guide to scraping paginated data reliably across offset, cursor, load more, and infinite scroll patterns.

Pagination is where many scraping jobs quietly fail. The parser still works, selectors still match, and the site still loads, but records go missing because the listing now advances with a cursor instead of a page number, or because a frontend swapped numbered links for a “Load more” button. This guide is a practical reference for handling the main pagination patterns in web scraping: offset pagination, cursor pagination, infinite scroll, and load more interfaces. It also explains what to track over time so your scraper keeps working as modern frontends and APIs evolve.

Overview

If you scrape lists of products, jobs, articles, reviews, search results, or directory entries, pagination is not a small implementation detail. It defines how you discover every item, how you deduplicate records, how you estimate coverage, and how you resume interrupted jobs.

In practice, pagination usually appears in one of four patterns:

Offset or page-number pagination: requests include values like ?page=3 or ?offset=60&limit=20.
Cursor pagination: the next request depends on a token such as cursor=abc123, after=..., or a timestamp/id boundary.
Load more pagination: a button triggers another request and appends more items to the current DOM.
Infinite scroll: scrolling near the bottom causes the frontend to fetch and render the next batch automatically.

These patterns may look different in the browser, but the scraper’s job is always the same: identify the data source, understand the continuation mechanism, and stop cleanly without missing or duplicating items.

The most reliable workflow is usually:

Inspect the page in your browser’s developer tools.
Watch the network calls while you click the next page, press load more, or scroll.
Prefer direct API or XHR requests when they expose the same data as the rendered page.
Use a browser automation tool only when the continuation logic truly depends on client-side execution.
Record stable checkpoints so the scraper can resume after failures.

That approach matters because DOM-first scraping often breaks earlier than request-level scraping. A button label can change. A container class can be renamed. But a JSON payload that drives the listing may remain fairly stable for longer.

If you are deciding between stacks, it helps to separate browser automation from HTML parsing. For JavaScript-heavy sites, browser tools are often useful for discovery and fallback. For server-rendered pages or stable endpoints, lightweight request-based scrapers are often easier to maintain. Related comparisons are covered in JavaScript Web Scraping in 2026: Puppeteer vs Playwright vs Cheerio and Python Web Scraping Stack Comparison: Requests vs BeautifulSoup vs Scrapy vs Playwright.

Pattern 1: Offset and page-number pagination

This is the most familiar pattern and often the simplest to scrape. You may see URLs such as:

/products?page=4
/search?offset=120&limit=30
/articles?p=2

Typical extraction tactic:

Start from page 1 or offset 0.
Increment by one page or by the requested batch size.
Stop when the response is empty, shorter than expected, or repeats the previous page’s identifiers.

Offset pagination looks stable, but it has a hidden weakness: if the underlying list changes while you scrape, offsets can shift. New items inserted at the top can cause duplicates or skipped records downstream. For frequently changing datasets, store stable item IDs and deduplicate after collection.

Pattern 2: Cursor pagination

Cursor pagination is common in APIs and modern apps because it scales better for large datasets and changing lists. Instead of asking for “page 5,” the client asks for “the next batch after this known record or token.”

You may see parameters such as:

cursor=eyJpZCI6...
after=last_seen_id
next_token=...
updated_before=timestamp

Typical extraction tactic:

Request the first batch.
Extract both the records and the continuation token.
Repeat until no next token is returned.

The main rule here is simple: do not try to fabricate the next cursor unless you fully understand the API contract. Many cursors are opaque by design. Treat them as pass-through values captured from the previous response.

Pattern 3: Load more interfaces

A load more button usually hides a paginated request behind a user action. To the user, the page stays the same while more items are appended. To the scraper, this often means there is still an underlying API call with predictable parameters.

Typical extraction tactic:

Click the button manually while watching network requests.
Identify whether a JSON, GraphQL, or HTML fragment request powers the appended content.
Reproduce that request directly if possible.
Use browser automation to click repeatedly only when no stable request interface is available.

If you automate the browser path, use a loop that detects when the button disappears, becomes disabled, or stops adding new item IDs.

Pattern 4: Infinite scroll

Infinite scroll is conceptually similar to load more, but the trigger is a scroll threshold instead of a click. The page watches the viewport, sends a request, and appends more cards or rows.

Typical extraction tactic:

Scroll incrementally, not just once to the absolute bottom.
Pause long enough for requests and rendering to finish.
Track known item IDs after each scroll cycle.
Stop when multiple scroll cycles produce no new IDs.

For many infinite scroll pages, the real work is still happening through an API call. The browser automation layer is just triggering it. Whenever possible, inspect the network panel and move to request-level scraping.

What to track

A scraper that handles pagination once is useful. A scraper that keeps handling it after the site changes is much more valuable. To get there, track the variables that usually drift over time.

1. The continuation mechanism

Record exactly how the next batch is requested:

Query parameters like page, offset, limit
Cursor fields like cursor, after, endCursor
Headers required for XHR or fetch requests
POST body values used by load more or GraphQL pagination

This is the first thing to check when a scraper suddenly plateaus early.

2. Batch size and expected page length

Track how many items usually arrive per request. If your scraper expects 24 items but starts receiving 12, the site may have changed layout rules, testing buckets, geolocation behavior, or filtering defaults.

Batch size is also useful for anomaly detection. A sudden drop in average items per page can signal throttling, partial rendering, or request rejection.

3. Stable item identifiers

Every paginated scraper should try to capture a stable per-item identifier such as a product ID, slug, canonical URL, or internal record key. This lets you:

Deduplicate across shifting pages
Detect repeated pages
Measure true coverage
Resume without relying only on page numbers

If no obvious ID exists, derive a fallback key from normalized URLs or a careful hash of critical fields.

4. Stop conditions

Define and log the reason the scraper stops. Good stop conditions include:

No next page link or no next cursor returned
Response contains zero items
Response repeats a previously seen cursor or page
No new item IDs after several load-more or scroll attempts

Weak stop conditions lead to silent loops, duplicated data, or incomplete collections.

5. Ordering fields

Track whether results are ordered by newest, relevance, price, popularity, or an opaque ranking. Pagination behaves differently depending on sorting. A rapidly changing “most recent” feed is much harder to scrape by offset than a stable alphabetical directory.

If the ordering can be controlled through parameters, store that explicitly. A stable sort often makes extraction and deduplication easier.

6. Request fingerprints and auth dependencies

Some paginated endpoints require:

CSRF tokens
Session cookies
Authorization headers
GraphQL operation names and variables
Signed parameters or short-lived tokens

Track which values are static, which are session-bound, and which expire quickly. This is especially important for infinite scroll scraping where the page appears public but the underlying request includes hidden client state. A JWT decoder, JSON formatter, base64 decode tool, or URL encoder/decoder can speed up this analysis when tokens and payloads are involved.

7. Rendering dependency

Determine whether the next batch is available from plain HTTP requests or only after JavaScript execution. That single distinction affects stack choice, runtime cost, and failure modes.

If data is available in raw JSON, a lightweight scraper may be enough. If requests only appear after client-side logic runs, Playwright or Puppeteer may be the safer route.

8. Anti-bot responses and throttling signals

Pagination failures are not always parsing failures. Track signs of enforcement such as:

Unexpected redirects
Blank pages after several requests
Reduced page sizes
Captcha or challenge interstitials
HTTP status changes or inconsistent payload schemas

Use polite request rates, caching, and sane retries. And review robots guidance and legal context for your use case in Robots.txt for Web Scraping: What It Means and What It Does Not and Web Scraping Legality Guide by Country: What Changes in 2026.

Cadence and checkpoints

The best way to keep pagination scraping reliable is to treat it as something you inspect on a schedule, not only when it breaks. This is especially useful for recurring datasets such as product catalogs, search results, or job listings.

Monthly checkpoints for stable targets

If the target site rarely changes, a monthly review is often enough. Check:

Whether the continuation parameters are unchanged
Whether the average batch size is still similar
Whether total page counts or item counts changed in plausible ways
Whether sample pages still return the same key fields

This kind of lightweight review catches gradual frontend migrations before a full failure occurs.

Quarterly checkpoints for broader workflow health

On a quarterly cadence, review not just the target site but your scraper design:

Can a browser-based flow now be replaced with direct requests?
Are there brittle selectors you can remove?
Is your resume logic still robust?
Are duplicates increasing due to feed churn?
Do logging and alerts surface pagination failures clearly?

This is also a good time to revisit tool choices and update internal documentation.

Per-run checkpoints for production jobs

For automated pipelines, log a few pagination-specific checkpoints on every run:

First page item count
Last successful cursor or page number
Total unique IDs collected
Number of duplicate IDs encountered
Stop reason
Any non-200 responses, retries, or challenge pages

These logs turn debugging from guesswork into comparison. When a scraper breaks, you can see whether it stopped too early, looped, or started receiving partial data.

A simple checkpoint template

For each target, keep a small record with:

Pagination type: offset, cursor, load more, infinite scroll
Entry URL: the listing page you start from
Request pattern: endpoint, method, params, headers, body
Unique key: item ID or normalized URL
Stop rule: no cursor, empty page, repeated IDs, button disabled
Known risks: unstable sorting, short-lived token, regional results
Last reviewed: date of manual validation

This is simple, but it makes recurring maintenance far easier.

How to interpret changes

When a pagination scraper starts underperforming, the visible symptom is often “fewer rows than usual.” The useful question is why. Different patterns point to different root causes.

If page counts drop suddenly

A sharp drop may mean:

The site changed filters or default sorting
The endpoint now requires an extra parameter
You are being throttled or served partial content
The scraper is failing after the first continuation step

Compare the first request and the second request side by side. Pagination bugs often appear not at the start, but at the handoff to the next batch.

If duplicates increase

Growing duplicates usually indicate one of three things:

Offset-based scraping on a changing feed
A broken cursor loop reusing the same continuation token
Scroll or load-more automation that is not waiting for new content before repeating

The fix is usually to rely more heavily on unique IDs, stronger stop conditions, and explicit checks for repeated cursors or repeated last-item IDs.

If the DOM changed but the data did not

This is a common and often fixable case. If your selectors broke after a redesign, inspect the network calls before rewriting the whole scraper. The rendered markup may have changed completely while the backing JSON endpoint remained mostly the same.

If the request still works but the schema changed

When item fields move or are renamed, pagination may still succeed while downstream parsing fails. Separate your pagination logic from your extraction logic when possible. That way, you can verify that the scraper is still discovering all pages even if individual field mapping needs an update.

If browser automation became flaky

Flaky scroll timing, lazy rendering, and hydration issues are signs to check whether you can shift from UI-driven pagination to direct requests. Browser automation is powerful, but it should not be the default if a stable endpoint exists underneath.

When to revisit

Use this article as a recurring checklist whenever a paginated target is important enough to monitor over time. In practice, revisit your pagination strategy in five situations:

After a frontend redesign: even small layout changes can hide a new request flow.
When total collected records move outside the normal range: especially if the change is abrupt.
When duplicate rates rise: often a sign of offset drift or repeated cursors.
When anti-bot behavior appears: challenge pages and partial payloads can look like pagination bugs.
On a monthly or quarterly review cadence: even when nothing seems broken.

A practical action plan looks like this:

Classify the target as offset, cursor, load more, or infinite scroll.
Identify the real data source in the network panel.
Store stable item IDs and explicit stop conditions.
Log the last successful page or cursor on every run.
Compare expected versus actual unique item counts.
Schedule a manual validation check on a recurring basis.

If you do only one thing differently, do this: stop treating pagination as just a loop. Treat it as a moving interface with observable signals. Once you log continuation tokens, page lengths, unique IDs, stop reasons, and enforcement signals, pagination becomes much easier to maintain.

That habit is what keeps a one-off scraper from turning into a recurring maintenance burden. And because pagination patterns change as sites move between server rendering, client-side apps, GraphQL, and hybrid architectures, this is one of the few scraping topics worth revisiting regularly.

How to Handle Pagination in Web Scraping: Offset, Cursor, Infinite Scroll, and Load More

Overview

Pattern 3: Load more interfaces

Pattern 4: Infinite scroll

What to track

1. The continuation mechanism

2. Batch size and expected page length

3. Stable item identifiers

4. Stop conditions

5. Ordering fields

6. Request fingerprints and auth dependencies

7. Rendering dependency

8. Anti-bot responses and throttling signals

Cadence and checkpoints

Monthly checkpoints for stable targets

Quarterly checkpoints for broader workflow health

Per-run checkpoints for production jobs

A simple checkpoint template

How to interpret changes

If page counts drop suddenly

If duplicates increase

If the DOM changed but the data did not

If the request still works but the schema changed

If browser automation became flaky

When to revisit

Related Topics

Webscraper.app Editorial

Up Next

Headless Browser Benchmark for Web Scraping: Playwright, Puppeteer, and Selenium

Web Scraping with Scrapy: When It Still Beats Browser Automation

Web Scraping with Playwright: A Practical Guide for Login Flows, Clicks, and Dynamic Pages