On the resemblance and containment of documents
A. Broder
The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that could be done independently for each document.