Cloudera, Inc.
Apparatus and method for sampling large data sets in a distributed data storage system
Last updated:
Abstract:
A system includes a distributed data storage system disseminated across worker machines connected by a network. A distributed data storage management module has instructions executed by a processor to utilize data block identifiers to track data block accesses to the distributed data storage system. A sampling module with instructions executed by the processor receives a new sample request from a client machine connected to the network. Initial data block samples are gathered from the distributed data storage system during a first time period. A revised sample request is received from the client machine during the first time period. The initial data block samples are gathered. New data block samples are collected from the distributed data storage system. The initial data block samples and the new data block samples are combined to form cumulative data block sample results. The cumulative data block sample results are supplied to the client machine.
Utility
27 Jun 2019
15 Dec 2020