The Walt Disney Company
Joint heterogeneous language-vision embeddings for video tagging and search

Last updated:

Abstract:

Systems, methods and articles of manufacture for modeling a joint language-visual space. A textual query to be evaluated relative to a video library is received from a requesting entity. The video library contains a plurality of instances of video content. One or more instances of video content from the video library that correspond to the textual query are determined, by analyzing the textual query using a data model that includes a soft-attention neural network module that is jointly trained with a language Long Short-term Memory (LSTM) neural network module and a video LSTM neural network module. At least an indication of the one or more instances of video content is returned to the requesting entity.

Status:
Grant
Type:

Utility

Filling date:

12 Jun 2017

Issue date:

9 Aug 2022