The Common Crawl Foundation builds an archive of web crawl data that anyone can access and analyze. Billions of pages are added each month. Code 402 Crawls provides user-friendly services to process these archives.
Whether you’re building a vertical search engine, researching historic market trends, generating link-building opportunities, or something else altogether, our services can help you get started immediately without having to maintain your own costly distributed processing infrastructure.
Contact us at hello@code402.com for more information.
With the Crawls Search tool, you can:
One common task using the Common Crawl is to identify websites that use a particular technology. You can search for a tag or phrase that is common to that technology, such as a reference to “cdn.shopify.com” for Shopify sites.
You can use this information to compile a list of domains to do your further analysis. Or you can access the full text of the Common Crawl results using the Source information.
You can also filter the common crawl to pages that talk about a specific topic. If you input a list of specific words or phrases that identify a topic, the search will output any sites with one of those phrases. For example, enter “labradoodle” to find pages talking about that breed of dog.
The cost of a search varies based on your search parameters and expected output size. Contact us with some information about your search for pricing, at hello@code402.com.
Yes. All use of our service must conform to our own Terms of Use as well as the Common Crawl Terms of Use.
It depends. If the expected output of your search is small, the files could be emailed directly to you. If the results are sufficiently large, you will need your own AWS account, specifically an S3 bucket, where we will upload the results.
No. While we offer services that use their data, Code 402 is not related to the Common Crawl Foundation.