This is an attempt to group related log records using Jaccard similarity coefficient.
Inspiration was taken from:
- Weighted MinHash on GPU helps to find duplicate GitHub repositories
- minhashcuda
- Finding Bieber: On removing duplicates from a set of documents
- Locality Sensitive Hashing for semantic similarity repository
What needs to be done in case this attempt will have any kind of future:
- Use minhashcuda for doing it at scale