Amazon.com, Inc.
Text encoding issue detection

Last updated: 24 Aug 2022

Abstract:

Method and apparatus for detecting text encoding errors caused by previously encoding the electronic document in multiple encoding formats. Non-word portions are removed from the electronic document. Embodiments determine whether words in the electronic document are likely to contain one or more text encoding errors, by dividing the first word into n-grams of length 2 or more. For each of the plurality of n-grams, a database is queried to determine a respective probability of the n-gram appearing in each of a plurality of recognized languages, and upon determining that the determined probabilities of two consecutive n-grams are each less than a predefined threshold probability, the first word is added to a list of words that likely contain text encoding errors. A confidence level that the first word includes the one or more text encoding errors is calculated, based on a lowest determined probably for the n-grams for the first word.

Status:

Grant

Type:

Utility

Filling date:

29 Nov 2017

Issue date:

23 Aug 2022

Full patent description

Patent application document

Amazon.com, Inc. Text encoding issue detection

Abstract:

Amazon.com, Inc.
Text encoding issue detection