Building an Enterprise-Level Smart Scraping & Data Enrichment System

2/26/2025, 8:40:12 PM

In today’s data-driven world, collecting reliable and structured information from the web is more important than ever. In this blog post, I'll walk you through the process of building a robust, enterprise-level system for web scraping and data enrichment using modern tools and APIs. This project integrated multiple components: Google Places API for generating an initial list of targets, Puppeteer for web scraping, LangChain with OpenAI for intelligent data extraction, recursive internal link processing, and Hunter.io for contact data enrichment. Finally, we also built a script to convert our JSON output into CSV files for easy reporting and analysis.

The Challenge

Our goal was to create a nested smart scraping system that could:

Generate an initial list of target companies using the Google Places API.

Scrape each target’s website for a wealth of information (including hyperlinks, page titles, meta descriptions, images, and text).

Leverage AI via LangChain and OpenAI to extract structured company details—such as company name, contact names (with first and last names), phone numbers, emails, and address fields—from the scraped content.

Recursively traverse internal links found on each website to gather more details and update our company records.

Enrich contact information using Hunter.io, specifically to retrieve missing email addresses.

Convert the final JSON output into CSV to support downstream analysis and reporting.

This multi-phase system had to be both robust and flexible—capable of handling inconsistencies in web page structures and adapting to changes in the data being scraped.

System Components and Architecture

1. Generating the Initial List with Google Places API

We started by building a script (createMainScrapeList.js) to generate our main list of companies. By sending a query (e.g., “nurseries in san bernardino county”) to the Google Places Text Search API, we retrieved an initial set of results. To handle pagination (since the API typically returns only 20 results per page), we implemented logic to loop through available pages until we reached a specified limit (in our example, 500 results) or until there were no more pages available.

For each place, we then used the Google Place Details API to extract key details—most notably the website URL. To ensure data quality, we filtered out unwanted companies using an exclusion list (e.g., skipping companies whose names contain “home depot”).

2. Web Scraping with Puppeteer

Once our main list was generated, we employed Puppeteer to perform headless browser scraping on each website. This allowed us to extract not only the page’s content (title, meta description, body text, image alts) but also to harvest all hyperlinks. These hyperlinks served two purposes:

Initial internal links provided by the main page were merged into a company-specific JSON file.

Recursive crawling: Internal links were revisited and scraped to gather additional company details, ensuring our dataset was as comprehensive as possible.

3. Intelligent Data Extraction with LangChain & OpenAI

Using LangChain’s ChatOpenAI integration along with a carefully designed prompt template, we processed the scraped data to extract structured information. The AI was tasked with pulling out:

Company name.

Contact names (with separation of first and last names).

Phone number, email, and address details.

This approach leverages state-of-the-art natural language understanding to handle diverse and unstructured web content.

4. Recursive Internal Link Processing

Not every piece of information resides on a website’s homepage. To capture additional details, we implemented a recursive internal link processor. This module:

Iteratively scans each internal link (ensuring only URLs matching the company’s base domain are considered).

Scrapes each page and passes the data through our AI extraction pipeline.

Merges new information into the main company record and updates the list of internal links.

Continues until all discovered internal links have been processed.

5. Data Enrichment via Hunter.io

After scraping and processing all available pages, we launched a final data enrichment phase. Using Hunter.io's API, we enriched our contact records by attempting to retrieve missing email addresses based on the contact’s first and last name, coupled with the company’s domain. This step ensures that our contact records are as complete as possible.

6. Converting JSON to CSV

In addition to the main scraping pipeline, we built a separate script to convert our JSON output into CSV files. By using an npm package like json2csv, we could dynamically flatten the JSON structure (especially for companies with multiple contacts) and generate CSV files for further analysis.

Key Takeaways

Modularity & Flexibility: By dividing the system into distinct modules (API integration, scraping, AI extraction, recursive processing, and data enrichment), we built a robust and maintainable architecture that can adapt to changes in the source data.

Leveraging Modern Tools: Combining Puppeteer with LangChain’s OpenAI integration allowed us to extract structured information from unstructured web pages—a task that is notoriously challenging with traditional scraping methods.

Scalability Considerations: The design anticipates large datasets by handling pagination from the Google Places API and recursively processing internal links. Additionally, post-scraping data enrichment with Hunter.io helps ensure contact details are comprehensive.

Data Pipeline End-to-End: From generating a target list to producing CSV outputs, this pipeline exemplifies an end-to-end solution that can serve various business intelligence and lead generation needs.

Conclusion

This project showcases how modern web scraping can be elevated by integrating multiple APIs and intelligent processing. By combining the Google Places API, Puppeteer, OpenAI via LangChain, recursive data processing, and Hunter.io enrichment, we built a comprehensive, enterprise-level system capable of gathering and structuring vast amounts of company data.

Whether you’re developing a sales lead tool, performing competitive analysis, or simply gathering business intelligence, this architecture provides a strong foundation for building advanced web scraping and data enrichment solutions.