International Business Machines Corporation
Web crawler platform

Last updated:

Abstract:

Systems, methods, and computer program products for implementing a web crawler platform comprising one or more containerized web crawler programs working in tandem to synergistically index web resources and reduce redundancy experienced by multiple web crawlers independently indexing overlapping web resources. The platform provides a URL namespace, allowing crawlers to register with the platform and create URL endpoints for other crawlers to discover existing crawlers registered to the platform and identify web resources previously indexed. The platform provides crawler to crawler communication and exchanges of data and metadata obtained from web resources that have been previously indexed, allowing for crawlers to share existing data or metadata without having to directly crawl through the web resource. As web crawlers move between data centers of different geolocations, the crawler's registered URL is mapped to subsequent IP addresses, allowing for transparency and continuous identification by other crawlers registered with the platform.

Status:
Grant
Type:

Utility

Filling date:

7 Aug 2019

Issue date:

11 Jan 2022