Microsoft Corporation
Efficient distributed joining of two large data sets

Last updated:

Abstract:

A distributed join is performed with two large data sets that are shuffled on different keys without shuffling the larger data set, even when the distributed join is performed on the key of the smaller data set. A third data set is generated that is shuffled on the key of the smaller data set and includes data associated with the key of the larger data set. The third data set and the smaller data set are joined on the shuffle key of the smaller data set to create a fourth data set that includes the first and second key. The fourth data set is shuffled on the key of the larger data set. The fourth data set and the larger data set are joined on the key of the larger data set to generate a fifth data set that can be shuffled on the key of the smaller data set.

Status:
Grant
Type:

Utility

Filling date:

3 Feb 2020

Issue date:

31 May 2022