How to provide relevant search results?
The relevance of search results is essential for finding information. Indeed, a user will almost never look further than the first few results of a search engine.
It is, therefore, necessary that the relevant information is ranked as high as possible so that the information sought by the user is found in the first results.
IMPORTANCE OF RANKING
The order or “ranking” of search results is essential for search engines, which will therefore use more or less complex algorithms to display the results that users will find most relevant first.
It is usually not possible to find the algorithms used by popular search engines. Indeed this is considered as a major competitive advantage, on which the popularity of a search engine depends. These search engines index more or less documents but the capacity to order properly the results will be decisive from the user perspective.
As search engines do not (or only partially) disclose their ranking criteria, some people try to find these criteria in order to improve their own ranking in Google-type search. Thus, it is possible to find some widely used criteria and algorithms, as described below.
STATE OF THE ART IN RANKING STRATEGY
In the case of a search including several terms (for example “acid deoxyribonucleic”), the question is what shall be used for the number of occurrences. Are we looking for the number of times the words “acid” and “deoxyribonucleic” are found side by side and in this order? The total number of times one of these two words is found in the text?
To solve this problem, most search engines use the concept of TF-IDF (Term Frequency-Inverse Document Frequency). A term is considered to be more or less important not only according to its number of occurrences, but also according to its specificity. For example, the term “acid” will be found in many documents about chemistry or biology, the frequency of documents with this term is surely high. The TF-IDF score will then be rather low. On the other hand, the term “deoxyribonucleic” will only be found in documents mentioning DNA, the frequency of documents with this term is potentially low so TF-IDF will be high. To assign a score to documents containing these two terms, the number of occurrences of the term “deoxyribonucleic” will therefore be more taken into account than that of the term “acid” as its TF-IDF (= specificity) score is expected to be much higher.
MACHINE LEARNING IS ALSO THERE WITH THE “LEARNING TO RANK”
ML techniques applied to the relevance of the search are grouped under the term “Learning to rank “. Most WWW search engines use this type of technique to sort their results.
Learning to rank methods are based on learning a model with the aim to discover the best order in which display search results. The score found by the algorithms is not absolute, it only helps to sort results between each other.
The variables used by the different algorithms vary between search engines, but remain very close to the ranking criteria that can be used for classical ranking: TF-IDF, age or popularity of the result, term found in the title,… (see last chapter ‘other usual criteria’ below)
There are several types of algorithms for learning to rank, more or less fast or
efficient. But the major problem is to obtain a training data set.
Indeed, as with any machine learning algorithm, it is necessary to use a data set to train a model. In the case of learning to rank, this dataset must include a set of queries, query results, and manual assessment of the query results. This is especially hard in our life science and/or enterprise context, there are much fewer users in the system than WWW search engines!
There are two types of assessment that can be made on the results of a query, which are explicit or implicit assessments:
- Explicit: experts will give a score to each search result (early adopters, test team, …)
- Implicit: users’ behavior will be used to deduct judgments. We can for example take into account the number of clicks on a result or on the contrary the fact that the user did not click on the first three results, the fact of rephrasing a request, the time spent on a page visited …
The explicit assessment is precise but very expensive to build. Implicit assessment is much easier to obtain, but contains noise that requires clean up.
The idea of “content augmentation” is to add pseudo-content to each document, which is a set of terms not present in the document but linked to its concepts.
For example, in an article which contains the word “cardiac”, there is a high probability that it also contains the words “heart”, “arrest” and “circulatory”, but a low probability that it contains the words “Clown” or “pajamas”.
By analyzing the co-occurrence of the words in a set of documents, we can deduce correlations, and for each document, we can generate a set of words which “could” be in the document but are not.
In practice, search engines do not just count the co-occurrences of words, but use more sophisticated methods such as the algorithms “Latent semantic analysis”, “Latent Dirichlet allocation” or more recently “Word2vec”.
Use of ontologies in content augmentation
In our specific domain, biological ontologies (like Disease ontology, cheBI, pathway ontology) offer links between concepts. For example, Alzheimer disease is known as a Central Nervous System disease. An idea is to use the term directly linked (SubClassOf), a synonym (hasExactSynonym), or an external reference (hasDbXref) to augment the content of a document. At DEXSTR, this is something that we specifically found as an improvement.
OTHER COMMON CRITERIA
Some criteria are only applicable to web pages (loading time, availability, HTML tags …), but others are more generic and may interest us.
Here are some examples:
- The terms do not include spelling errors
- If the query contains several terms, the terms that are close in the documents found will have a better score: (“viral disease” > “viral […] disease” ).
- Documents will have a higher score if the terms are found in an important field. This is especially true for the file or pathname in most contexts.
- Documents will have a higher score if the terms are found at the start of a field
- Newest data
- Number of clicks or downloads, …
Many search engines adapt the order of search results to the user who requests it: has the user already shown affinities for a type of content? Is it possible to identify other users similar to the one making the request?
Search history can also be used to influence search results.
TAKE AWAY MESSAGE
There are many ranking strategies, some mentioned in this article, others can be found (via a nicely ranked web search!). If you need to improve the quality of your own search engine, the first thing you need to do is to set up performance/relevance indicators (implicit or explicit as described upper). That way, you can playfully follow the improvement of your search page and the happiness of your users!