In natural language processing a w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document—that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.
The document, "a rose is a rose is a rose" can be tokenized as follows:
The set of all contiguous sequences of 4 tokens (N-grams, here: 4-grams) is
By removing duplicate elements from this set, a 4-shingling is obtained:
Contents |
For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or

where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the Jaccard coefficient describing similarity and diversity of sample sets.
stock | retire | vm
Why are we here?
All text is available under the terms of the GNU Free Documentation License
This page is cache of Wikipedia. History