Making the world wide web accessible to all businesses
The Common Crawl Foundation builds an archive of web crawl data that anyone can access and analyze. Billions of pages are added each month. Code 402 Crawls provides user-friendly services to process these archives.
Whether you're building a vertical search engine, researching historic market trends, generating link-building opportunities, or something else altogether, our services can help you get started immediately without having to maintain your own costly distributed processing infrastructure.
You'll benefit from:
- Intuitive interfaces with previews
- Intelligent pre-filtering to minimize compute time and expense
- Pay-as-you-go expenses
Find domains by URL, page content, and location.
Automatically discover similar sites starting from a seed set.
Run your code on a subset of the Common Crawl.
Extract structured data from the web with a human in-the-loop.
Search WARC, WET, and WAT archives by URL or page content.
Scan millions of homepages for less than the cost of a cup of coffee.
Generate leads, monitor trends, and do competitive analysis with ease. Provide the parameters for your search:
- Language of web page
- Domain name filter
- Content filter, either strings or a powerful regular expression
- Filter by tags or keywords that indicate technologies used, e.g. type of web platform, third-party services
- Geographic filter (best effort)
Then sit back and let our infrastructure perform the search with maximum efficiency. Download your results or use our other services to keep processing.
Integrates with: Sites Like This, Worker, Catalog
Sites Like ThisIncubating
There are hundreds of millions of sites on the web, but only a handful of them are relevant to your business. Machine learning can teach a computer to recognize those sites. The Sites Like This service is perfect for managing this task.
- Curate multiple sets for different tasks
- Label documents quickly with online preview and label hotkeys
- Use your Sites Like This model as a filter for future Homepage Searches
Integrates with: Homepage Search, Worker, Catalog
Once you've found relevant documents, you'll want to process them further. Worker makes this simple:
- We invoke it for each relevant page, in batch for maximum efficiency
- We store output of the transformation for you to download, or pass it to the Catalog service
Integrates with: Homepage Search, Sites Like This, Catalog
The web can be the wild west. Occasionally, you'll need additional human oversight when processing documents. The Catalog service can help you:
- Declare a schema for your data
- Populate a record for each URL automatically (via Worker)
- Collaborate with others to update the record by hand as needed
Integrates with: Homepage Search, Sites Like This, Worker
Data mine billions of pages without the hassle of maintaining clusters of computers.
When homepages alone aren't enough, Enterprise Search grants access to the complete archives underpinning the Common Crawl. Our cost-effective search platform delivers results in JSON format directly to your own S3 bucket for further processing.
Get Started Today With A Free Preview
The best way to see how our tools can work for you is simply to try them. You can immediately explore some basic services with no commitment.
- Are there restrictions on how I can use the data?
Why are email addresses masked in Homepage Search?
To comply with anti-spam laws, we don't include unmasked email addresses in our results. Most jurisdictions have strict rules about unsolicited commercial email (UCE). Many UCE laws include an exemption for "implied consent", for example, Canada's Anti-Spam Law permits sending mail to someone who:
"has conspicuously published [...] the electronic address to which the message is sent, the publication is not accompanied by a statement that the person does not wish to receive unsolicited commercial electronic messages at the electronic address and the message is relevant to the person’s business, role, functions or duties in a business or official capacity"
Even so, this determination must be made on a case-by-case basis by visiting the website.
Therefore, we only include masked email addresses in our results. You can use their presence as a signal to visit the underlying site and see if it's permissible to send them email.
Do I need my own AWS account?
Only the Enterprise Search feature requires your own AWS account, specifically an S3 bucket for results. Homepage Search does not require an AWS account.
What is the complete list of metadata filters for Homepage Search?
We continually add filters based on user feedback. To see the most up to date list, create an account and use the free Preview option. If there is a specific technology you would like to see, contact us.
How many credits does a homepage search cost?
It varies. Let's say that you ran a search with a URL filter, a geographic filter and a content filter. A crawl archive for one month has about 15,000,000 sites. If your URL and geographic filters excluded 13,000,000 of the sites, you would be charged:
0.1 ✕ 13,000,000 pages filtered by metadata + 1 ✕ 2,000,000 pages filtered by content = 3,300,000 credits
Can I use services that are Incubating?
Please contact us for details.
Is Code 402 related to the Common Crawl Foundation?
No. While we offer services that use their data, Code 402 is not related to the Common Crawl Foundation.