首页 Cheatsheet
3.10.0 2026-03-19 apify/crawlee 打开 Zread
340px

⚡ 入口与定位 >>>

Crawlee 是 Node.js 网页爬虫与浏览器自动化库,统一封装 HTTP 爬取和真实浏览器爬取,支持反爬规避、自动队列、数据存储开箱即用。
Crawlee 覆盖爬虫全流程:链接爬取 → 内容抓取 → 数据存储,帮你快速构建可靠爬虫。

核心能力

  • HTTP 爬虫 : 零配置 HTTP2、浏览器级请求头、TLS 指纹复制
  • 浏览器爬虫 : 支持 Playwright / Puppeteer,同一接口切换
  • 反爬规避 : 默认生成类人指纹,自动轮换代理和会话
  • 持久化队列 : URL 队列自动持久化,支持 breadth-first / depth-first
  • 自动扩缩容 : 根据系统资源自动调整并发
  • 存储抽象 : 数据和文件存储插件化,可本地或云端

快速起手

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
  async requestHandler({ request, page, enqueueLinks, pushData, log }) {
    const title = await page.title();
    log.info(`Title of ${request.loadedUrl} is '${title}'`);
    await pushData({ title, url: request.loadedUrl });
    await enqueueLinks();
  },
});

await crawler.run(['https://crawlee.dev']);
const data = await Dataset.getData();
console.table(data.items);

🏗️ 仓库骨架 / 模块分工 >>>

Crawlee 将爬虫分为 Fetcher 层(抓取)、Processor 层(处理)、Storage 层(存储),理解模块分工是写好爬虫的前提。

四大核心模块

模块职责代表类
Fetcher底层 HTTP/浏览器请求HttpFetcher, PlaywrightFetcher
Crawler任务调度、队列、扩缩容CheerioCrawler, PlaywrightCrawler
Processor页面解析、数据提取requestHandler 回调函数
Storage数据与文件持久化Dataset, KeyValueStore, RequestQueue

爬虫类选择决策树

静态页面 / JSON API
    │
    ├─ HTML 简单解析 ──→ CheerioCrawler(最快)
    │
    └─ 需要 JS 渲染
            │
            ├─ 轻量渲染 ──→ JSDOMCrawler
            │
            ├─ Playwright 可用 ──→ PlaywrightCrawler
            │
            └─ Puppeteer 偏好 ──→ PuppeteerCrawler

🚀 起手工作流 >>>

最常用起手流程:从单页抓取 → 链接爬取 → 数据存储,三步走完大多数场景。

第一步:单页抓取 + 数据存储

import { CheerioCrawler, Dataset } from 'crawlee';

const crawler = new CheerioCrawler({
  async requestHandler({ $, request, pushData }) {
    const title = $('h1').text();
    await pushData({ title, url: request.url });
  },
});

await crawler.run(['https://example.com']);
await crawler.exportData('./result.csv');

第二步:自动链接爬取 + 深度控制

const crawler = new CheerioCrawler({
  async requestHandler({ $, pushData }) {
    const title = $('h1').text();
    await pushData({ title });
  },
  maxRequestsPerCrawl: 50,
  maxDepth: 3,
});

await crawler.run(['https://example.com']);

第三步:多类爬虫统一入口

import { PlaywrightCrawler, Dataset } from 'crawlee';

const crawler = new PlaywrightCrawler({
  async requestHandler({ page, pushData, enqueueLinks }) {
    await enqueueLinks({ globs: ['**/products/**'] });
    const price = await page.locator('.price').textContent();
    await pushData({ price });
  },
  headless: true,
});

await crawler.run(['https://shop.example.com']);

📋 高频 Recipes / 常见套路 >>>

日常高频场景的直接可运行模板。

R1:抓取分页列表

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
  async requestHandler({ page, enqueueLinks, pushData }) {
    const items = await page.locator('.product-item').all();
    for (const item of items) {
      await pushData({
        title: await item.locator('h3').textContent(),
        price: await item.locator('.price').textContent(),
      });
    }
    await enqueueLinks({ selector: '.next-page' });
  },
});

await crawler.run(['https://shop.example.com/category']);

R2:登录后抓取受保护内容

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
  async requestHandler({ page, request, enqueueLinks }) {
    if (request.url.includes('/dashboard')) {
      await page.waitForSelector('.user-profile');
      await enqueueLinks({ urlMatches: '/dashboard/' });
    }
  },
});

await crawler.addSessions(1);
await crawler.run(['https://app.example.com/login']);
await crawler.click('button[type="submit"]');

R3:爬取 JSON API

import { HttpCrawler } from 'crawlee';

const crawler = new HttpCrawler({
  async requestHandler({ response, pushData }) {
    const json = await response.json();
    await pushData(json.results);
  },
});

await crawler.run(['https://api.example.com/data?page=1']);

R4:文件下载(PDF / PNG)

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
  async requestHandler({ page }) {
    const downloadPromise = page.waitForEvent('download');
    await page.locator('button.download').click();
    const download = await downloadPromise;
    await download.saveAs('./downloads/' + download.suggestedFilename());
  },
});

await crawler.run(['https://example.com/report']);

R5:截图捕获

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
  async requestHandler({ page, pushData }) {
    await page.goto('https://example.com');
    const screenshot = await page.screenshot({ fullPage: true });
    await pushData({ screenshot });
  },
});

R6:无限滚动页面

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
  async requestHandler({ page, pushData }) {
    let prevHeight;
    while (true) {
      const items = await page.locator('.item').all();
      for (const item of items) {
        await pushData({ text: await item.textContent() });
      }
      const currHeight = await page.evaluate(() => document.body.scrollHeight);
      if (currHeight === prevHeight) break;
      prevHeight = currHeight;
      await page.evaluate(() => window.scrollTo(0, currHeight));
      await page.waitForTimeout(1000);
    }
  },
});

🪝 Hooks / 生命周期 >>>

通过 Hooks 控制请求前后、页面导航前后的行为,适合统一注入反爬逻辑、登录态、代理。

请求级 Hooks

const crawler = new PlaywrightCrawler({
  preNavigationHooks: [
    async ({ page, session }, gotoOptions) => {
      await page.setExtraHTTPHeaders({ 'X-Custom-Header': 'value' });
    },
  ],
  postNavigationHooks: [
    async ({ page, session }) => {
      await page.waitForLoadState('networkidle');
    },
  ],
});

路由级 Hooks

const crawler = new PlaywrightCrawler({
  requestHandler: async ({ request, page, enqueueLinks }) => {
    if (request.userData.label === 'detail') {
      await enqueueLinks({ userData: { label: 'detail' } });
    }
  },
});

🔐 代理与会话管理 >>>

代理轮换和会话管理是规模化爬虫的必修课,Crawlee 提供内置 SessionPool 和 ProxyConfiguration。

代理配置

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
  proxyUrls: [
    'http://proxy1.example.com:8080',
    'http://proxy2.example.com:8080',
  ],
});

const crawler = new PlaywrightCrawler({
  proxyConfiguration,
  async requestHandler({ page }) {
    const title = await page.title();
    console.log(title);
  },
});

会话管理

const crawler = new PlaywrightCrawler({
  sessionPoolSize: 10,
  async requestHandler({ page, session }) {
    session.setCookies([{ name: 'session_id', value: 'abc123' }]);
    await page.goto('https://example.com');
  },
});

Session 使用技巧

await crawler.addSessions(5);
await crawler.run(['https://example.com/page1', 'https://example.com/page2']);

Crawler 类速查

CheerioCrawler — HTTP 爬虫(最快)

import { CheerioCrawler } from 'crawlee';

const crawler = new CheerioCrawler({
  requestHandler: async ({ $, request, pushData, enqueueLinks }) => {
    await pushData({ content: $('body').text() });
    await enqueueLinks();
  },
  maxConcurrency: 50,
});

await crawler.run(['https://example.com']);

JSDOMCrawler — 支持 DOM 操作的 HTTP 爬虫

import { JSDOMCrawler } from 'crawlee';

const crawler = new JSDOMCrawler({
  async requestHandler({ window, request, pushData }) {
    const { document } = window;
    const title = document.querySelector('h1').textContent;
    await pushData({ title, url: request.url });
  },
});

PlaywrightCrawler — Playwright 浏览器爬虫

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
  headless: true,
  requestHandler: async ({ page, pushData }) => {
    await pushData({ title: await page.title() });
  },
});

PuppeteerCrawler — Puppeteer 浏览器爬虫

import { PuppeteerCrawler } from 'crawlee';

const crawler = new PuppeteerCrawler({
  requestHandler: async ({ page, pushData }) => {
    await pushData({ title: await page.title() });
  },
});

核心配置选项

通用配置

选项类型默认值说明
maxRequestRetriesnumber3单请求最大重试次数
maxConcurrencynumber1最大并发请求数
maxRequestsPerCrawlnumberundefined单次运行最大请求数
maxDepthnumberundefined最大爬取深度
requestTimeoutnumber30s请求超时时间
requestHandlerTimeoutSecsnumber60s处理器超时

浏览器特有配置

选项类型说明
headlessboolean是否无头模式
userAgentstring自定义 User-Agent
proxyConfigurationProxyConfiguration代理配置
sessionPoolSizenumber会话池大小
preNavigationHooksHook[]导航前钩子
postNavigationHooksHook[]导航后钩子

示例:完整配置

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
  maxRequestRetries: 5,
  maxConcurrency: 10,
  maxRequestsPerCrawl: 100,
  requestTimeout: 60_000,
  headless: false,
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  proxyConfiguration: new ProxyConfiguration({
    proxyUrls: ['http://proxy.example.com:8080'],
  }),
  sessionPoolSize: 5,
  preNavigationHooks: [
    async ({ page }, gotoOptions) => {
      await page.setViewportSize({ width: 1920, height: 1080 });
    },
  ],
  async requestHandler({ page, pushData, enqueueLinks }) {
    await enqueueLinks({ globs: ['**/products/**'] });
    await pushData({ title: await page.title() });
  },
});

数据存储

Dataset 操作

import { Dataset } from 'crawlee';

await Dataset.pushData({ title: 'Example', url: 'https://example.com' });

const { items } = await Dataset.getData();
console.log(items);

await crawler.exportData('./result.csv');
await crawler.exportData('./result.json');

KeyValueStore 操作

import { KeyValueStore } from 'crawlee';

await KeyValueStore.setValue('screenshot', screenshotBuffer, { contentType: 'image/png' });

const value = await KeyValueStore.getValue('key');

RequestQueue 操作

import { RequestQueue } from 'crawlee';

const queue = await RequestQueue.open();
await queue.addRequest({ url: 'https://example.com', userData: { label: 'detail' } });

const { items } = await queue.fetchNextRequests();
const request = await queue.getRequest('unique-key');
await queue.markRequestHandled(request.id);

路由与请求分发

按 URL 模式路由

const crawler = new PlaywrightCrawler({
  maxRequestRetries: 3,
  async requestHandler({ request, page, enqueueLinks }) {
    if (request.url.includes('/products/')) {
      await enqueueLinks({ userData: { label: 'product' } });
    } else if (request.url.includes('/categories/')) {
      await enqueueLinks({ userData: { label: 'category' } });
    }
  },
});

自定义路由函数

import { router } from 'crawlee';

const handleProduct = async ({ page, pushData }) => {
  await pushData({
    title: await page.locator('h1').textContent(),
    price: await page.locator('.price').textContent(),
  });
};

const handleCategory = async ({ page, enqueueLinks }) => {
  await enqueueLinks({ userData: { label: 'product' } });
};

export default router.createRouter({ handleProduct, handleCategory });

错误处理

错误处理器

const crawler = new PlaywrightCrawler({
  requestHandlerFailedHandler: async ({ request, error }) => {
    console.error(`Request ${request.url} failed: ${error.message}`);
    await Dataset.pushData({ url: request.url, error: error.message });
  },
});

重试配置

const crawler = new PlaywrightCrawler({
  maxRequestRetries: 5,
  retryRequestPatterns: [
    { pattern: /rate.limit/i, maxRetries: 10 },
  ],
});

CLI 命令

项目创建

npx crawlee create my-crawler
cd my-crawler
npm start

安装依赖

npm install crawlee playwright

预发布版本

npm install crawlee@next

反爬规避配置

指纹管理

const crawler = new PlaywrightCrawler({
  useFingerprint: true,
  fingerprintOptions: {
    fingerprintSize: 10,
  },
});

浏览器配置文件

const crawler = new PlaywrightCrawler({
  browserChannels: ['chrome', 'firefox', 'webkit'],
});

与 Apify 平台集成

import { Actor } from 'apify';

await Actor.init();

const crawler = new PlaywrightCrawler({
  async requestHandler({ page, pushData }) {
    await pushData({ title: await page.title() });
  },
});

await crawler.run([{ url: 'https://example.com' }]);

await Actor.exit();