Alpha Helices identification

This tutorial shows how to build a neural network that takes voxelized proteins as input. The tutorial replicates the example done in the paper.

alphaHelixTutorial

First of all we import the libraries we need

In [1]:

from pyuul import utils,VolumeMaker
import os,urllib,torch
import numpy as np

#the following line is required only if you use a python notebook
%matplotlib notebook 

import matplotlib
import matplotlib.pyplot as plt

We fetch the protein structure (pdbID 1wou) and we parse the PDB using the pyuul utils module

In [2]:

os.makedirs('exampleStructures',exist_ok=True)
urllib.request.urlretrieve('http://files.rcsb.org/download/1WOU.pdb', 'exampleStructures/1wou.pdb')
coords, atname = utils.parsePDB('exampleStructures/1wou.pdb')

we also can chose which device to use: GPU ("cuda") or CPU ("cpu"). Since not everybody has a cuda compatible GPU, let's use CPU for this time. However, GPU calculation is gonna be way faster

In [3]:

device = "cpu"

We defined which atoms belong to a secondary sturucture using DSSP. In this tutorial, for simplicity, we precalculated it. We therefore have an array with ones in position that correspond to atoms belonging to alpha helixes and zeros everywhere else

In [4]:

alpha_helices = [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
         1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]

for every atom in the coordinates there is a label in alpha_helices

In [5]:

print("number of labels:",len(alpha_helices[0]),"number of atoms:",len(coords[0]))

number of labels: 947 number of atoms: 947

Let's now define our convolutional 3D model. We will use a very simple model with only 3 layers, followed by a 3 layers feed forward neural netwrk. As usual, the network is a child of the torch.nn.Module class

In [6]:

class Conv3dModel(torch.nn.Module):

    def __init__(self, chIn, name="conv3dModel",dev="cpu"):
        super(Conv3dModel, self).__init__()
        self.name = name

        chout=100
        self.conv = torch.nn.Sequential(torch.nn.Conv3d(chIn, 10, 7, 1, 3), torch.nn.Dropout(0.1), torch.nn.InstanceNorm3d(10),
                                    torch.nn.LeakyReLU(), torch.nn.Conv3d(10, 10, 7, 1, 3), torch.nn.InstanceNorm3d(10),
                                    torch.nn.LeakyReLU(), torch.nn.Conv3d(10, chout, 1, 1, 0)).to(dev)
       
        self.final = torch.nn.Sequential(torch.nn.Linear(chout,chout),torch.nn.LeakyReLU(),
                                     torch.nn.Linear(chout, chout), torch.nn.LeakyReLU(),
                                     torch.nn.Linear(chout, 1), torch.nn.Sigmoid()
                                     ).to(dev)

    def forward(self, x):

        o = self.conv(x)
        o = o.permute(0,2,3,4,1)
        o= self.final(o)
        return o.squeeze()

We now can create our neural network object, using the Conv3dModel class we just created

In [7]:

model = Conv3dModel(chIn=5, dev=device)

we also need some other standard neural network tools: an optimizer and a loss function. As optimizer we use Adam algorithm. It is quite standard but feel free to try any other pytorch optimizer. As loss function, we are gonna use binary crossentropy, since we are going to perform a binary classification

In [8]:

optimizer = torch.optim.Adam(model.parameters())
lossFunction = torch.nn.BCELoss()

Now we can calculate atom radius and channels of the protein using the utils of PyUUL

In [9]:

atom_channel = utils.atomlistToChannels(atname).to(device)
radius = utils.atomlistToRadius(atname).to(device)

We can now use the main volumeMaker object to build both the volumetric representation and the labels. For what concerns the volumetric representation, we can simply run

In [10]:

volmaker = VolumeMaker.Voxels(device=device)
voxellizedVolume = volmaker(coords, radius, atom_channel,resolution=0.5).to_dense()

For what concerns the labels, we can use DSSP output (alpha_helices) to define the voxels that contain atoms belonging to alpha helices. We therefore use alpha_helices as channels. Doing so, will make all the atoms labelled as 0 (not beloning to alpha helices) to go to channel 0, while the ones labelled as 1 (belonging to alpha helices) to channel 1. We therefore select channel 1 and take all the voxels with an occupancy greater than 0.5 as positive labels.

In [11]:

label = volmaker(coords, radius, torch.tensor(alpha_helices,device=device),resolution=0.5).to_dense()[:,[1],].ge(0.5).float()

print("size of labels tensor:",label.shape,"size of labels tensor:", voxellizedVolume.shape)

size of labels tensor: torch.Size([1, 1, 76, 85, 93]) size of labels tensor: torch.Size([1, 5, 76, 85, 93])

Lets now have a look at the volume we just generated. In order to visualize the volume, let's sum over the channels dimension.

In [12]:

ax = plt.figure().add_subplot(1, 2, 1,projection='3d')
voxels = voxellizedVolume.sum(1) > 0.5
ax.voxels(voxels[0])
plt.show()

Now we have the main optimization loop. We iterate the training for 10 epochs and we optimize the network's weights. This is a resources intensive step!

In [13]:

epochs = 10
for e in range(epochs):

    yp = model(voxellizedVolume)
    loss = lossFunction(yp, label[0,0, :, :, :])
    loss.backward()

    optimizer.step()
    optimizer.zero_grad()

    print("epoch",e, "loss ", float(loss.data.cpu()))
    

epoch 0 loss  0.6997795701026917
epoch 1 loss  0.6748469471931458
epoch 2 loss  0.6535153388977051
epoch 3 loss  0.6341975927352905
epoch 4 loss  0.6127128601074219
epoch 5 loss  0.5914099216461182
epoch 6 loss  0.5696786642074585
epoch 7 loss  0.5486479997634888
epoch 8 loss  0.5237216353416443
epoch 9 loss  0.4978470802307129

Our model now has been trained!