Biomolecule management

Biomolecule management: taming the biochemical universe

Dealing with biomolecule is a challenge!!! whereas it exists inter compatible formats for chemical entities, there are not for biomolecules. As a result, Register, Store and most important Search for biomolecules in a dataset needs efforts.


According to the literature in the domain of life science, a biomolecule is a biological macromolecule used as therapeutic agents. It is a complex structure where entities such as proteins, peptides, oligonucleotides, and small molecule drugs may be covalently linked to each other, or may include chemically modified biological moieties. (Source : HELM: A Hierarchical Notation Language for Complex Biomolecule Structure Representation)

In the meantime, we need to keep in mind that there are no clear barrier between small molecules and biomolecules. The existing entities are more in a continuum of size and nature. 

What is the matter behind the distinction of what is a small molecule or what is a biomolecule  is how do we consider them in term of representation – in order guess what ? to compute them !!


Despite of challenge to use them as drug (instability, potentially immunogenic, hard to manufacture)  the proportion of biomolecule on the market as drug is increasing : 17 in 2018, 12 in 2015 and 2017

You might have a look to the following website to understand more the complexity of producing biological drugs :

The following figure is extracted from :
G. de la Torre, B.; Albericio, F. The Pharmaceutical Industry in 2018. An Analysis of FDA Drug Approvals from the Perspective of Molecules. Molecules 2019, 24(4), 809;


In the current era of digitalisation, we need to find representations of substances that allow to identify them using standard terminology in order to compute them, to compare them, … 

Biomolecules can be considered from various perspectives :

  • Chemical perspective : we can focus on active site or modification of biomolecule and consider them with the same principles as small chemical entities. This will be fine if you want to run atomical level analysis, for example model a ligand – protein interaction. However, the size of such molecules makes the process heavy and slow if not impossible. 
  • Sequence perspective : biomolecules are often composed of a repetition of certain patterns of amino acid or nucleotides or sugars that are called monomers in this context. one way to consider macromolecule is to use a long string that represents this sequence of pattern. This is obviously the case for genomic sequence and the famous nucleobases A, C, G, T, U. However, non natural chemical modification of biomolecule make them impossible to represent only by sequence. Also, biomolecule used as drug are often not a single entity alone but an association. For example peptide, antibody and linker, which cannot be represented using sequences. 

To deal with this situation, some new representation are rising up :

  • HELM (hierarchical editing language for macromolecules) representation : HELM would be an equivalent of SMILES (simplified molecular-input line-entry system) for chemical entities, adapted to macromolecule. The representation is a string that have a certain vocabulary (symbol to represent natural and modified monomers) and a grammar that is a set of rules and symbols (like {}, |, $, …) to model the link and structure of the entities. Example can be find here :
  • SCSR representation : which is mixing sequence strategy and chemical aspects in an extended version of Mol files. It uses templates to compress the chemical information of repetitive polymers  (Self-Contained Sequence Representation: Bridging the Gap between Bioinformatics and Cheminformatics). This strategy can be used to tackle the performance issue raised by the size of the biomolecules. 


When it comes to digitalisation, 3 elements matters :

  • Registration : capacity to detect and retrieve the entity in the various sources of information, to capture it. 
  • Storage : capacity to store the information
  • Search and query : most important, capacity to retrieve information, as a simple query (search for all entities that have this substructure ? ) or as aggregated information (what is the proportion of molecules in the dataset are antibody-drug conjugate ?)

Clearly the last point is key to define your strategy. Based on your needs you need to choose the most appropriate representation. Then you need to check the feasibility of the Registration and storage elements.

For example, if you need to conjugate data from existing Structure / Activity Relation  database with small molecule with new biological entity you may choose SCSR representation that will allow you to continue to create structure activity relationship as part of your data asset. 

If you have an activity that deal only with biomolecule, no existing chemical entity database and no need to query your data with atomic level question (like which entities have this custom substructure) you should go for the HELM representation which is text based, easier to manipulate. 

In both situations, the data registration is a challenge. 

  • For chemical entities, multiple formats are mostly inter compatibles and can be imported from and to various systems. They can also be detected from text (for example using the Naming technology from Chemaxon). 
  • For biomolecules, the question is more difficult. Conversion from a format to another is difficult, , existing databases are limited  and extraction from text has no equivalent as document to structure approach.

One direction is to rely on Text Processing strategy to detect biomolecule in your dataset. 

Here below, you can see an example on Inquiro detecting proteins names for data related to diabetes in adolescent age group. This gives researchers a quick overview of how this type of biomolecules are spread over the data he is working on.


Querying a biomolecule in a dataset is not trivial as there is not so many standards formats that aim to describe these complex strutures. Nevetheless, relying on Text Processing strategy provides capabilities to reveal insghts or at a minimum an overview of how biomolecules are spread over the dataset.