Making the world wide web accessible to all businesses

The Common Crawl Foundation builds an archive of web crawl data that anyone can access and analyze. Billions of pages are added each month. Code 402 Crawls provides user-friendly services to process these archives.

Whether you’re building a vertical search engine, researching historic market trends, generating link-building opportunities, or something else altogether, our services can help you get started immediately without having to maintain your own costly distributed processing infrastructure.

Contact us at hello@code402.com for more information.

Features

With the Crawls Search tool, you can:

  • Access the complete archives underpinning the Common Crawl.
  • Search WARC, WET, and WAT archives by URL or page content.
  • Data mine billions of pages without the hassle of maintaining clusters of computers.
  • See results delivered in JSON format directly to your own S3 bucket.

Examples

One common task using the Common Crawl is to identify websites that use a particular technology. You can search for a tag or phrase that is common to that technology, such as a reference to “cdn.shopify.com” for Shopify sites.

You can use this information to compile a list of domains to do your further analysis. Or you can access the full text of the Common Crawl results using the Source information.

You can also filter the common crawl to pages that talk about a specific topic. If you input a list of specific words or phrases that identify a topic, the search will output any sites with one of those phrases. For example, enter “labradoodle” to find pages talking about that breed of dog.

Pricing

The cost of a search varies based on your search parameters and expected output size. Contact us with some information about your search for pricing, at hello@code402.com.

Frequently Asked Questions