Microsoft Corporation
DISTRIBUTED HISTOGRAM COMPUTATION FRAMEWORK USING DATA STREAM SKETCHES AND SAMPLES
Last updated:
Abstract:
Methods for distributed histogram computation in a framework utilizing data stream sketches and samples are performed by systems and devices. Distributions of large data sets are scanned once and processed by a computing pool, without sorting, to generate local sketches and value samples of each distribution. The local sketches and samples are utilized to construct local histograms on which cardinality estimates are obtained for query plan generation of distributed queries against distributions. Local statistics of distributions are also merged and consolidated to construct a global histogram representative of the entire data set. The global histogram is utilized to determine a cardinality estimation for query plan generation of incoming queries against the entire data set. The addition of new data to a data set or distribution involves a scan of the new data from which new statistics are generated and then merged with existing statistics for a new global histogram.
Utility
31 Aug 2020
18 Nov 2021