Post Overview
SEO ,
Web Development
8 min read

Building an SEO Spider with Puppeteer

During my time at Double Dot, our digital agency, I had the opportunity to closely observe my colleagues as they navigated through the world of SEO tools. I watched them use industry-leading tools like Screaming Frog for crawling, extracting metadata, detecting broken links, and conducting detailed audits on websites.

It was fascinating to see how these tools could provide an in-depth look into a website’s structure and performance. However, as someone who loves to experiment and build things from scratch, I found myself thinking, “What if I could build something similar?”

Inspired by this curiosity and my interest in automating SEO tasks, I decided to try my hand at creating my own SEO Spider using Puppeteer. The goal was to replicate some of the core functionalities of tools like Screaming Frog, focusing on tasks like crawling websites, extracting SEO metadata, and checking for issues such as broken links and redirects.

In this guide, I’ll walk you through how I built this simple SEO Spider with Puppeteer. By the end of this project, you’ll have a tool that can:

  • Extract Title & Meta Descriptions
  • Check Canonical & Open Graph Tags
  • Analyze H1-H6 Structure
  • Detect Broken Links & Redirects

Let’s dive in and see how you can build your own SEO Spider from the ground up!

1. Installing Puppeteer

Before we start writing the script, we need to install Puppeteer, which is a Node.js library that allows us to control headless Chrome or Chromium browsers. Puppeteer can simulate user actions on a webpage, making it perfect for SEO tasks like crawling and extracting data.

To begin, install Puppeteer using npm:

npm install puppeteer

After installing Puppeteer, create a new file named index.js. This is where we’ll write our script to interact with the web pages and extract the SEO data.

To run the script we just need to type:

npm index.js

2. Extracting SEO Metadata

A core part of SEO auditing is collecting metadata like title tags, meta descriptions, canonical URLs, and Open Graph tags. These are essential for understanding how search engines perceive a page.

Here’s how we can extract this metadata using Puppeteer. This script navigates to a webpage, collects key SEO data, and then logs it to the console:

const puppeteer = require("puppeteer");

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto("https://wearedoubledot.com", {
        waitUntil: "domcontentloaded",
    });

    const seoData = await page.evaluate(() => {
        return {
            canonical:
                document.querySelector('link[rel="canonical"]')?.href ||
                "No Canonical",
            title: document.title || "No Title",
            metaDescription:
                document.querySelector('meta[name="description"]')?.content ||
                "No Meta Description",
            openGraphTitle:
                document.querySelector('meta[property="og:title"]')?.content ||
                "No OG Title",
            openGraphImage:
                document.querySelector('meta[property="og:image"]')?.content ||
                "No OG Image",
        };
    });

    console.log("Extracted SEO Data:", seoData);
    await browser.close();
})();

In the code snippet above:

  • document.querySelector('link[rel="canonical"]')?.href: Grabs the canonical URL.
  • document.title: Extracts the page title.
  • document.querySelector('meta[name="description"]')?.content: Gets the meta description.
  • document.querySelector('meta[property="og:title"]')?.content: Retrieves the Open Graph title.
  • document.querySelector('meta[property="og:image"]')?.content: Gets the Open Graph image URL.

This script ensures that we can easily extract essential metadata from any webpage.

3. Crawling Multiple Pages and analyze H1-H6 Structure

While crawling a single page is useful, SEO audits typically require crawling multiple pages of a website to ensure consistency across the entire site. To do this, we need to fetch all internal links dynamically instead of manually defining a list of URLs.

In real-world scenarios, hardcoding URLs is not practical. Instead, we should start from a landing page, extract all internal links, and recursively crawl them. However, for the sake of simplicity in this example, we will use a static list of URLs. Here’s how we can expand our script to handle multiple pages:

const puppeteer = require("puppeteer");

const urls = [
    "https://wearedoubledot.com",
    "https://wearedoubledot.com/stories",
    "https://wearedoubledot.com/who-we-are",
];

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    for (let url of urls) {
        await page.goto(url, { waitUntil: "domcontentloaded" });
        const seoData = await page.evaluate(() => {
            let headings = {};
            document.querySelectorAll("h1, h2, h3, h4, h5, h6").forEach((h) => {
                if (!headings[h.tagName]) {
                    headings[h.tagName] = [];
                }
                headings[h.tagName].push(h.innerText.trim());
            });

            const sortedHeadings = Object.keys(headings)
                .sort((a, b) => {
                    const order = { H1: 1, H2: 2, H3: 3, H4: 4, H5: 5, H6: 6 };
                    return order[a] - order[b];
                })
                .reduce((obj, key) => {
                    obj[key] = headings[key];
                    return obj;
                }, {});
            headings = sortedHeadings;

            return {
                canonical:
                    document.querySelector('link[rel="canonical"]')?.href ||
                    "No Canonical",
                title: document.title || "No Title",
                metaDescription:
                    document.querySelector('meta[name="description"]')
                        ?.content || "No Meta Description",
                openGraphTitle:
                    document.querySelector('meta[property="og:title"]')
                        ?.content || "No OG Title",
                openGraphImage:
                    document.querySelector('meta[property="og:image"]')
                        ?.content || "No OG Image",
                headings,
            };
        });

        console.log(`SEO Data for ${url}:`, seoData);
    }

    await browser.close();
})();

Explanation

  • We define an array urls that holds the URLs of the pages we want to crawl.
  • The script then iterates over each URL, opens the page, extracts the SEO metadata, and logs it.

We also added functionality to the code that efficiently extracts headings from the page, organizes them by tag, and ensures they are presented in the correct hierarchical order (H1, H2, H3, etc.).

Dynamically Extracting Internal Links

If we wanted to dynamically generate the list of URLs instead of using a static array, we could extract all internal links from the landing page and recursively crawl them.
To achieve this, we would define a function crawlPage() , which would:

  • Find all internal links using:
const links = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('a'))
            .map(link => link.href)
            .filter(href => href.startsWith(window.location.origin));
    });
  • Loop through the extracted links and recursively call crawlPage(page, currentUrl) for each one (currentUrl).

This approach ensures that our SEO Spider dynamically discovers and audits all pages within a website, making it more practical for real-world SEO analysis.

4. Detecting Broken Links & Redirects

Broken links and unnecessary redirects can significantly impact a website’s SEO. Let’s extend our spider to check for broken links and redirects.

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto("https://wearedoubledot.com", {
        waitUntil: "domcontentloaded",
    });

    const links = await page.evaluate(() => {
        return Array.from(document.querySelectorAll("a")).map(
            (link) => link.href
        );
    });

    console.log(`Checking ${links.length} links...`);

    for (let link of links) {
        if (link.includes("mailto:") || link.includes("tel:")) {
            console.log(`Skipping link: ${link}`);
            continue;
        }
        try {
            const response = await page.goto(link, {
                waitUntil: "domcontentloaded",
            });
            console.log(`${link} -> Status: ${response.status()}`);
        } catch (error) {
            console.log(`${link} -> Error: ${error.message}`);
        }
    }

    await browser.close();
})();

Explanation:

  • We use document.querySelectorAll('a') to grab all the <a> elements on the page and extract their href attribute (the URL).
  • We then visit each link, check its HTTP status, and log the result.
  • If there’s an issue with the link, such as a 404 error or redirect, it will be captured and logged.

This is a great way to identify broken links and redirects that could affect the user experience and SEO.

5. Combining Everything: The Full SEO Spider

Now, let’s combine everything we’ve built into a single powerful SEO Spider that crawls multiple pages, extracts metadata, and detects broken links & redirects.

const puppeteer = require("puppeteer");

const urls = [
    "https://wearedoubledot.com",
    "https://wearedoubledot.com/stories",
    "https://wearedoubledot.com/who-we-are",
];

(async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    for (let url of urls) {
        await page.goto(url, { waitUntil: "domcontentloaded" });

        const seoData = await page.evaluate(() => {
            let headings = {};
            document.querySelectorAll("h1, h2, h3, h4, h5, h6").forEach((h) => {
                if (!headings[h.tagName]) {
                    headings[h.tagName] = [];
                }
                headings[h.tagName].push(h.innerText.trim());
            });

            const sortedHeadings = Object.keys(headings)
                .sort((a, b) => {
                    const order = { H1: 1, H2: 2, H3: 3, H4: 4, H5: 5, H6: 6 };
                    return order[a] - order[b];
                })
                .reduce((obj, key) => {
                    obj[key] = headings[key];
                    return obj;
                }, {});
            headings = sortedHeadings;

            return {
                canonical:
                    document.querySelector('link[rel="canonical"]')?.href ||
                    "No Canonical",
                title: document.title || "No Title",
                metaDescription:
                    document.querySelector('meta[name="description"]')
                        ?.content || "No Meta Description",
                openGraphTitle:
                    document.querySelector('meta[property="og:title"]')
                        ?.content || "No OG Title",
                openGraphImage:
                    document.querySelector('meta[property="og:image"]')
                        ?.content || "No OG Image",
                headings,
            };
        });

        console.log(`SEO Data for ${url}:`, seoData);

        const links = await page.evaluate(() => {
            return Array.from(document.querySelectorAll("a")).map(
                (link) => link.href
            );
        });

        console.log(`Checking ${links.length} links on ${url}...`);

        for (let link of links) {
            if (link.includes("mailto") || link.includes("tel")) {
                console.log(`Skipping link: ${link}`);
                continue;
            }

            try {
                const response = await page.goto(link, {
                    waitUntil: "domcontentloaded",
                });
                console.log(`${link} -> Status: ${response.status()}`);
            } catch (error) {
                console.log(`${link} -> Error: ${error.message}`);
            }
        }
    }

    await browser.close();
})();

This script combines all the previous steps into one, so it will crawl multiple URLs, extract metadata, and check the status of all internal links.

Conclusion

Building our own SEO Spider with Puppeteer allowed us to replicate some of the powerful features found in tools like Screaming Frog. This script provides us with a simple way to:

  • Extract Title & Meta Descriptions
  • Check Canonical
  • Check Open Graph Tags
  • Analyze H1-H6 Structure
  • Detect Broken Links & Redirects

By automating these tasks, we can improve our SEO audits, identify issues more efficiently, and ensure that websites are optimized for search engines.

This is just the beginning. You can extend this spider by adding more features, like exporting the data to CSV/JSON, implementing a sitemap crawler, or even running it on a schedule for automated audits.

Let’s keep building!

View project
play video