Introduction
What is a biological macromolecule?
If you are looking at this documentation, you probably already know what a macromolecule is. Macromolecules are an essential element of cell life and are extremely heterogeneous. They can catalyze chemical reactions (i.e., proteins), transfer information among generations (i.e., DNA) or constitute the cell itself (i.e., lipids, carbohydrates and proteins).
Machine learning in structural biology
The study of the 3D structure of these macromolecules has dramatically increased in importance during the last years. The amount of structural data available in public databases such as the protein data bank is increasing at an unprecedented rate. This has not only the effect of increasing our knowledge about structural biology, but it has also opened the door to the application of machine learning algorithms to biological structural data.
The technological gap
However, biological structures are often hard to handle. From public datasets, you can download atomic coordinates, but this type of data is not directly usable in standard machine learning algorithms. For this reason, structural bioinformatics has been lacking behind in machine learning with respect to other fields such as computer vision, where modern neural network architectures, such as 3D-convolutional neural network or transformers, are largely used.
What is PyUUL?
To overcome this problem we built PyUUL, a pytorch library that can transform biological structures in differentiable 3D objects that are suitable for machine learning algorithms developed for computer vision. This library therefore greatly increases the number of neural network architectures applicable to structural bioinformatics. Currently, the user can choose between three different types of data representation: voxel-based, surface point cloud and volumetric point cloud. If you want to learn more about PyUUL, you can read the manuscript at:
Possible applications in the scientific world
PyUUL can be used to import machine learning algorithms from computer vision to structural bioinformatics. Some of the algorithms listed below might be good candidates for future works:
Protein classification and domain identification Point networks are used for point cloud (one of the representation PyUUL provides) classification and segmentation. This means this architecture might be suitable to find and classify structural sub-modules (domains) of proteins (https://openaccess.thecvf.com/content_cvpr_2017/html/Qi_PointNet_Deep_Learning_CVPR_2017_paper.html).
End-to-end protein structure prediction A volumetric representation of proteins might be used as an additional prediction step in end to end protein structure prediction, adding a 3d-convolutional branch to the network and expanding the work of Mohammed AlQuraishi (https://www.biorxiv.org/content/10.1101/265231v1.full.pdf).
Protein structure superimposition, algorithms of point cloud registration, such as the one described by Yang et al (https://arxiv.org/abs/2001.07715), might be used to perform alignment-free structural superimposition.