Microsoft Corporation
Efficient freshness crawl scheduling
Last updated:
Abstract:
The technology described herein builds an optimal refresh schedule by minimizing a cost function constrained by an available refresh bandwidth. The cost function receives an importance score for a content item and a change rate for the content item as input in order to optimize the schedule. The cost function is considered optimized when a refresh schedule is found that minimizes the cost while using the available bandwidth and no more. The technology can build an optimized schedule to refresh content with incomplete change data, content with complete change data, or a mixture of content with and without complete change data. It can also re-learn content item change rates from its own schedule execution history and re-compute the refresh schedule, ensuring that this schedule takes into account the latest trends in content item updates.
Utility
22 May 2019
5 Jul 2022