Start today with
the Common Crawl!

Get Started

Making the world wide web accessible to all businesses

The Common Crawl Foundation builds an archive of web crawl data that anyone can access and analyze. Billions of pages are added each month. Code 402 Crawls provides user-friendly services to process these archives.

Whether you're building a vertical search engine, researching historic market trends, generating link-building opportunities, or something else altogether, our services can help you get started immediately without having to maintain your own costly distributed processing infrastructure.

You'll benefit from:

  • Intuitive interfaces with previews
  • Intelligent pre-filtering to minimize compute time and expense
  • Pay-as-you-go expenses

Services

Our services work well on their own or in conjunction with each other to build a more complex data processing pipeline.

Homepage Search

Find domains by URL, page content, and location.

Incubating

Sites Like This

Automatically discover similar sites starting from a seed set.

Incubating

Worker

Run your code on a subset of the Common Crawl.

Incubating

Catalog

Extract structured data from the web with a human in-the-loop.

Enterprise Search

Search WARC, WET, and WAT archives by URL or page content.

Sites Like This

Incubating

There are hundreds of millions of sites on the web, but only a handful of them are relevant to your business. Machine learning can teach a computer to recognize those sites. The Sites Like This service is perfect for managing this task.

  • Curate multiple sets for different tasks
  • Label documents quickly with online preview and label hotkeys
  • Use your Sites Like This model as a filter for future Homepage Searches

Integrates with: Homepage Search, Worker, Catalog

Worker

Incubating

Once you've found relevant documents, you'll want to process them further. Worker makes this simple:

  • You write a JavaScript, Python, or Java program
  • We invoke it for each relevant page, in batch for maximum efficiency
  • We store output of the transformation for you to download, or pass it to the Catalog service

Integrates with: Homepage Search, Sites Like This, Catalog

Catalog

Incubating

The web can be the wild west. Occasionally, you'll need additional human oversight when processing documents. The Catalog service can help you:

  • Declare a schema for your data
  • Populate a record for each URL automatically (via Worker)
  • Collaborate with others to update the record by hand as needed

Integrates with: Homepage Search, Sites Like This, Worker

Get Started Today With A Free Preview

The best way to see how our tools can work for you is simply to try them. You can immediately explore some basic services with no commitment.

Pricing

Buy services individually to build the subscription that meets your needs.

Storage - $10/month

  • 10 GB of storage
  • 10 GB of bandwidth
  • Additional bandwidth is charged at $0.10/GB
  • Additional storage can be purchased in 10 GB increments for $10/month

Homepage Search - $30/month

  • Requires Storage
  • Includes 200,000,000 search credits; unused credits expire
  • Searches cost 1 credit per page (0.1 credit per page for metadata-only searches)
  • Additional credits automatically purchased at a cost of $10 per 100,000,000

Enterprise Search - $50/month

  • Requires your own S3 bucket
  • Charged actual compute cost plus 100% management fee
  • Pause and resume jobs to control spending

Frequently Asked Questions

Don't see what you're looking for? Contact us with your question.

  • Yes. All use of our service must conform to our own Terms of Use as well as the Common Crawl Terms of Use.

  • To comply with anti-spam laws, we don't include unmasked email addresses in our results. Most jurisdictions have strict rules about unsolicited commercial email (UCE). Many UCE laws include an exemption for "implied consent", for example, Canada's Anti-Spam Law permits sending mail to someone who:

    "has conspicuously published [...] the electronic address to which the message is sent, the publication is not accompanied by a statement that the person does not wish to receive unsolicited commercial electronic messages at the electronic address and the message is relevant to the person’s business, role, functions or duties in a business or official capacity"

    Even so, this determination must be made on a case-by-case basis by visiting the website.

    Therefore, we only include masked email addresses in our results. You can use their presence as a signal to visit the underlying site and see if it's permissible to send them email.

  • Only the Enterprise Search feature requires your own AWS account, specifically an S3 bucket for results. Homepage Search does not require an AWS account.

  • We continually add filters based on user feedback. To see the most up to date list, create an account and use the free Preview option. If there is a specific technology you would like to see, contact us.

  • It varies. Let's say that you ran a search with a URL filter, a geographic filter and a content filter. A crawl archive for one month has about 15,000,000 sites. If your URL and geographic filters excluded 13,000,000 of the sites, you would be charged:

    0.1 ✕ 13,000,000 pages filtered by metadata + 1 ✕ 2,000,000 pages filtered by content = 3,300,000 credits

  • Please contact us for details.

  • No. While we offer services that use their data, Code 402 is not related to the Common Crawl Foundation.