WARCannon was built to simplify and cheapify the process of ‘grepping the internet’.

With WARCannon, you can:

  • Build and test regex patterns against real Common Crawl data
  • Easily load Common Crawl datasets for parallel processing
  • Scale compute capabilities to asynchronously crunch through WARCs at frankly unreasonable capacity.
  • Store and easily retrieve the results

[email protected]:c6fc/warcannon.git
$ cd warcannon
warcannon$ cp settings.json.sample settings.json

Edit settings.json to taste:

  • backendBucket: Is the bucket to store the terraform state in. If it doesn’t exist, WARCannon will create it for you. Replace ‘< somerandomcharacters >’ with random characters to make it unique, or specify another bucket you own.
  • awsProfile: The profile name in ~/.aws/credentials that you want to piggyback on for the installation.
  • nodeInstanceType: An array of instance types to use for parallel processing. ‘c’-types are best value for this purpose, and any size can be used. ["c5n.18xlarge"] is the recommended value for true campaigns.
  • nodeCapacity: The number of nodes to request during parallel processing. The resulting nodes will be an arbitrary distribution of the nodeInstanceTypes you specify.
  • nodeParallelism: The number of simultaneous WARCs to process per vCPU2 is a good number here. If nodes have insufficient RAM to run at this level of parallelism (as you might encounter with ‘c’-type instances), they’ll run at the highest safe parallelism instead.
  • nodeMaxDuration: The maximum lifespan of compute nodes in seconds. Nodes will be automatically terminated after this time if the job has still not completed. Default value is 24 hours.
  • sshPubkey: A public SSH key to facilitate remote access to nodes for troubleshooting.
  • allowSSHFrom: A CIDR mask to allow SSH from. Typically this will be &lt;yourpublicip&gt;/32

Common Crawl via the AWS Open Data program. Common Crawl is unique in that the data retrieved by their spiders not only captures website text, but also other text-based content like JavaScript, TypeScript, full HTML, CSS, etc. By constructing suitable Regular Expressions capable of identifying unique components, researchers can identify websites by the technologies they use, and do so without ever touching the website themselves. The problem is that this requires parsing hundreds of terabytes of data, which is a tall order no matter what resources you have at your disposal.

RegExr with the ‘JavaScript’ format to build and test regular expressions against known-good matches.

You also have the option of only capturing results from specified domains. To do this, simply populate the domains array with the FQDNs that you wish to include. It is recommended that you leave this empty [] since it’s almost never worthwhile (the processing effort saved is very small), but it can be useful in some niche cases.

exports.domains = ["example1.com", "example2.com"];

Once the matches.js is populated, run the following command:

warcannon$ ./warcannon testLocal <warc_path>

WARCannon will then download the warc and parse it with your configured matches. There are a few quality-of-life things that WARCannon does by default that you should be aware of:

  1. WARCannon will download the warc to /tmp/warcannon.testLocal on first run, and will re-use the downloaded warc from then on even if you change the warc_path. If you wish to use a different warc, you must delete this file.
  2. WARCs are large; most coming in at just over 1GB. WARCannon uses the CLI for multi-threaded downloads, but if you have slow internet, you’ll need to exercise patience the first time around.

On top of everything else, WARCannon will attempt to evaluate the total compute cost of your regular expressions when run locally. This way, you can be informed if a given regular expression will significantly impact performance before you execute your campaign.

Image

Image