Hello, WARC: Common Crawl code samples

Code samples to work with the Common Crawl in Java, Python, Go and JavaScript

November 6, 2019 | common crawl | by: Colin Dellow

The Common Crawl offers a wealth of material for data scientists, information security researchers and marketers. Accessing it requires using a platform like Code 402 Crawls or writing a computer program. This article explores the options available to you by providing the same example code in several popular programming languages, and compares their relative performance.

Just want the code? The code samples can be found at the @code402/warc-benchmark repository.

The Common Crawl

The Common Crawl is a US non-profit that archives billions of webpages each month. These webpages are packaged in a special format known as Web Archive, or WARC, format. The Common Crawl publishes these captures to an Amazon S3 bucket that is publicly accessible. If you run your processing code in Amazon EC2's us-east-1 region, you won't have to pay for the bandwidth used to transfer files from the bucket.

A typical WARC file is made up of thousands of entries that look like:

WARC-Type: response
WARC-Date: 2019-09-22T05:53:42Z
WARC-Record-ID: 
Content-Length: 6390
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: 
WARC-Concurrent-To: 
WARC-IP-Address: 185.199.108.153
WARC-Target-URI: https://cldellow.com/
WARC-Payload-Digest: sha1:IJIL6EAJ22TYOVBOUTHS6IF7MOZWFBV7
WARC-Block-Digest: sha1:UR5ECPRJZBVKCJUXNO34KMKBP2W4OZUW
WARC-Identified-Payload-Type: text/html

HTTP/1.1 200 OK
Server: GitHub.com
Content-Type: text/html; charset=utf-8
Last-Modified: Sun, 01 Sep 2019 21:01:04 GMT
ETag: W/"5d6c3190-1630"
Access-Control-Allow-Origin: *
Expires: Sun, 22 Sep 2019 06:03:42 GMT
Cache-Control: max-age=600
X-Crawler-Content-Encoding: gzip
X-Proxy-Cache: MISS
X-GitHub-Request-Id: 94B0:0B93:985CE7:C47294:5D870C64
X-Crawler-Content-Length: 1928
Content-Length: 5680
Accept-Ranges: bytes
Date: Sun, 22 Sep 2019 05:53:42 GMT
Via: 1.1 varnish
Age: 0
Connection: keep-alive
X-Served-By: cache-bwi5047-BWI
X-Cache: MISS
X-Cache-Hits: 0
X-Timer: S1569131622.014584,VS0,VE7
Vary: Accept-Encoding
X-Fastly-Request-ID: b1cd040b732adec6874e262669f1df743be6031a

...HTML content...

There are three blocks in each entry: the metadata about the entry itself, the HTTP response headers from the server and the HTML response itself . Each block is separated by two consecutive \r\n sequences.

The most common type of WARC contains raw HTML captures, although others such as WET and WAT also exist. You can read about them in the Common Crawl's Navigating the WARC File Format post.

The Task

We'll focus on a common task: market research. Our goal is to search all .com webpages for links to YouTube videos.

We'll search the raw HTML captures contained in WARC files, which we'll stream directly from S3. One thing we notice immediately is that there are many different formats for YouTube URLs. https://www.youtube.com/watch?v=dQw4w9WgXcQ and https://youtu.be/dQw4w9WgXcQ both refer to a classic video. Luckily, GitHub user Aidan Feldman has written a regular expression that can be used to capture them all [1].

All reported durations are the median duration of 3 trials on an a1.medium instance in the us-east-1 region. In order to distinguish the overhead of the library itself, we report two sets of numbers:

the time to download and parse the WARC's entries for .com pages, without doing any work
the time to download, parse, and search the WARC

Finally, a note of caution: the code samples used in this article focus on illustrating the basics. They do not rigorously handle errors, character set encodings, or cleaning up opened resource handles.

The Languages

Bash

Source Code 33 sec 128 sec

We don't recommend using Bash for processing petabytes of data! Still, you'd be surprised what you can achieve with decades-old technology. It can be a good way to quickly get your hands dirty and give you a rough idea of how fast a code-based solution should be.

Java

Source Code 35 sec 60–235 sec

We used the IIPC's jwarc library on JDK 11. Note that we specify a range for search times. This is because we tested three different regular expression engines, the standard JDK engine, Google's re2j engine, and Anders Møller's Brics Automaton engine. Both Re2j and the Brics Automaton engine implement a different approach to regular expressions that sacrifice some power for faster searches, they each took 60 seconds, whereas the JDK engine took 235 seconds.

JavaScript (Node)

Source Code 82 sec 118 sec

We used the node-warc framework running on node v10.16.

Go

Source Code 91 sec 106 sec

We used the go-warc library on golang v1.10.4.

Python

Source Code 100 sec 105 sec

We used the warcio library on Python 3.6.8.

Code 402

Source Code Unavailable 17 sec 45 sec

Our Java-based platform is proprietary, but we plan to publish another article describing how it achieves the performance it does.

Conclusion

Looking at the chart below, it's evident that the Java ecosystem has spent the most effort tuning for efficiency. This is important since an additional 15 seconds of processing per WARC file costs about $2 across an entire crawl [2]. If using our platform for data mining the Common Crawl isn't an option, we'd recommend Java based on its performance.

Ultimately, pick the tool that best fits your needs based on budget, experience, and ecosystem of libraries.

Notes

Learn more about regular expressions using our online brics.dk regex tester.
Given 56,000 WARC files per crawl, an extra 15 seconds per WARC requires an additional 233 hours of compute time in Amazon EC2. Given a spot instance price of $0.0084/hour for an a1.medium, that results in $1.96 of extra charges.