Crawls

Making the world wide web accessible to all businesses

The Common Crawl Foundation builds an archive of web crawl data that anyone can access and analyze. Billions of pages are added each month. Code 402 Crawls provides user-friendly services to process these archives.

Whether you’re building a vertical search engine, researching historic market trends, generating link-building opportunities, or something else altogether, our services can help you get started immediately without having to maintain your own costly distributed processing infrastructure.

Features

With the Crawls Search tool, you can:

Access the complete archives underpinning the Common Crawl.
Search WARC, WET, and WAT archives by URL or page content.
Data mine billions of pages without the hassle of maintaining clusters of computers.
See results delivered in JSON format directly to your own S3 bucket.

Examples

Technology Search

One common task using the Common Crawl is to identify websites that use a particular technology. You can search for a tag or phrase that is common to that technology, such as a reference to “cdn.shopify.com” for Shopify sites.

You can use this information to compile a list of domains to do your further analysis. Or you can access the full text of the Common Crawl results using the Source information.

Topic Search

You can also filter the common crawl to pages that talk about a specific topic. If you input a list of specific words or phrases that identify a topic, the search will output any sites with one of those phrases. For example, enter “labradoodle” to find pages talking about that breed of dog.

Pricing

The cost of a search varies based on your search parameters and expected output size. Contact us with some information about your search for pricing, at hello@code402.com.

Frequently Asked Questions

Are there restrictions on how I can use the data?

Yes. All use of our service must conform to our own Terms of Use as well as the Common Crawl Terms of Use.
Do I need my own AWS account?

It depends. If the expected output of your search is small, the files could be emailed directly to you. If the results are sufficiently large, you will need your own AWS account, specifically an S3 bucket, where we will upload the results.
Is Code 402 related to the Common Crawl Foundation?

No. While we offer services that use their data, Code 402 is not related to the Common Crawl Foundation.