Finding duplicates in a large set of files
In a large set of files, there are high chances to have files duplicated over the set. Finding these duplicates allow to link them, so that a modification of one copy or its metadata can be indicated on other copies, allowing better sharing and reuse of information.
For exact duplication, these duplicates can be easily detected by relying on the md5. We can assume that two files with same md5 have very high probabilities to be the same.
But we are also interested in detecting less obvious duplicates:
- Same file content with different type, for example a Word file that has been exported as PDF
- Different versions of the same file
Such files will have very similar contents, but different md5.
SOLR MORE LIKE THIS
Apache Solr (http://lucene.apache.org/solr/) is a search platform based on Apache Lucene. We already use Solr at Dexstr for its indexing and searching capabilities.
One interesting feature of Solr for our need is “More Like This (MLT)”. This feature allows to request Solr for similar documents, with a similarity score associated to each response document.
At first, this feature seemed to perfectly suits our needs, given that we already index the files content on a Solr index (the file content being extracted with Tika, another Apache tool – https://tika.apache.org/).
Our first idea was the following :
- Request solR for similar file content
- Take similar documents above a threshold
The first results were not as good as expected, indeed solR returns close similarity scores for documents only with a few different words and documents with similar topic.
To understand these results, we looked deeper how the MLT works.
MLT is based on “interesting words”, which are tokens (words in our case) that have high frequency in the requested document, and low frequency in the whole dataset. These frequencies (term frequency and document frequency) give scores by terms, and the terms with better scores will be used for the request.
MLT similarity matrix
To have more details about MLT mechanism, which is not very well documented in Solr or Lucene documentation, have a look at this post: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/
There are two problems for our use case to use the Solr similarity score:
- The score is relative between the documents returned by a same request, it is not comparable between the results of two requests, so it would be very difficult to set a score threshold to consider two documents as similar enough. From Lucene FAQ itself, the score is not meaningful because of the way it is calculated (https://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_filter_by_score.3F)
- The similarity being calculated only based on few terms that are considered as “interesting”, two very different documents with same topic will have a high similarity score, mostly if this topic is rare in the whole dataset.
So the MLT feature is very interesting to find documents with same topic, but in our case it will not be able to differentiate documents with same topics and documents with very similar content that can be two versions of the same file.
Still, even if there are false positives, the documents with similar content are returned by Solr.
We can then use Solr to find potential candidates for duplicates, which is fast given that the Solr index is used. Then we can use other similarity calculation methods to keep only very similar documents.
We studied different methods of similarity calculation. Here is a non-exhaustive list of these methods:
- Cosine similarity: this method is widely used for similarity calculation. It is based on a distance measure between two vectors of terms frequency (often TF-IDF, which is a ratio between term frequency in the document and in the dataset). In our case, this method has been tested and is not satisfactory because if two documents share nearly the same words in similar proportions, the score will be high. The consequence is the same than for the Solr More Like This, documents with similar topics will have high scores.
- Character-based similarity methods, for example Levenshtein or Needleman-Wunsch methods. These methods will compare texts character by character (words and punctuation can be used instead of characters). They take into account all differences between texts (two words transposed will be considered as a transformation). Some tests gave very good results in our case, but these methods are very heavy in terms of calculation, and will not be performant enough to be used in a large dataset.
- Dice coefficient: ratio between the common terms numbers and all terms number. It compares the similarity of bag of words, taking into account the occurrence of each term but not their order.
- Overlap: close to Dice coefficient, but considers the overlaps. The similarity score will be maximum if one of the documents is a subset of the other. It seemed interesting for a version detection, as versioning often add or remove text to documents, but there is a bias depending of the size of the documents: if a small document has same topic than a large one, almost all the terms of the small one will be on the other, and the similarity score will be very high.
For our case, the Dice coefficient has the best results for reasonable calculation time. The simplification of not considering the order of the words seemed an acceptable simplification with our tested examples.
We calculated the Dice coefficient of two documents by splitting them into tokens using Lucene and counting the number of commons terms between two documents.
ALGORITHM TO FIND DUPLICATES
The implemented algorithm to find versions of the same document has been the following:
- Find interesting documents with Solr MLT requests
- For the n documents with better score, calculate Dice coefficient
- As we have the file metadata, including file title, extracted by Tikka, we add a score “bonus” if two documents have same title
- Take documents with a score greater than a threshold
- For retained similar documents, do the same
→ We obtain a cluster of similar documents
We have then a group of multiple documents that we consider to be duplicates/same versions of the same file. We then want to find the history, and for this we use file metadata like last save date, last modification date…
In this first version of the duplicate finder, there are some limitations we are currently working on:
- If two versions of same file are very different, it will be difficult to detect it, we should give more importance to files technical metadata.
- The threshold to determine if documents are same versions is the same for all datasets. It could be interesting to be able to define different thresholds depending on business rules on files.
- It is difficult to know the history of the versions: between two versions of the same files, how to know which one is the more recent? On some files format, for example Office files, the files metadata like the last modification date allows to guess the file history, but for other formats such as text files, it is not possible. We can rely for example on file upload date in our system
We tested our algorithm with multiple versions and formats of the same files, and the algorithm found all the duplicates. After successful tests on controlled datasets, we deployed this tool to run on a real and very large dataset of one of our customers, and the results are very promising.
Charlotte, Back End Lead Developer.