3D Equivariant Diffusion For Target-Aware Molecule Generation and Affinity Prediction
Targetdiff
ICLR 2023
1、Contributions
*一个端到端的框架,用于在蛋白靶点条件下生成分子,该框架明确考虑了蛋白质和分子在三维空间中的物理相互作用。
*就我们所知,这是针对靶向药物设计的第一个概率扩散公式,其中训练和采样过程以非自回归和SE(3)-等变的方式对齐,这得益于移位中心操作和等变GNN。
*提出了几个新的评估指标和额外的见解,使我们能够在许多不同的维度上评估模型生成的分子。实证结果证明了我们的模型优于另外两个代表性基准模型。
*提出了一种基于我们的框架评估生成分子质量的有效方法,其中模型可以作为评分函数来帮助排名,或者作为无监督特征提取器来提高结合亲和力预测的准确性。
2、Problem definition
A protein binding site is represented as a set of atoms P = ( x P ( i ) , v P ( i ) ) i = 1 N P P = {(x^{(i)}_P , v^{(i)}_P )}^{N_P}_{i=1} P=(xP(i),vP(i))i=1NP, where N P N_P NP is the number of protein atoms, x P ∈ R 3 x_P ∈ R^3 xP∈R3 represents the 3D coordinates of the atom, and v P ∈ R N f v_P ∈ R^{N_f} vP∈RNf represents protein atom features such as element types and amino acid types. Our goal is to generate binding molecules M = ( x L ( i ) , v L ( i ) ) i = 1 L M M = {(x^{(i)}_L , v^{(i)}_L )}^{L_M}_{i=1} M=(xL(i),vL(i))i=1LM conditioned on the protein target. For brevity, we denote molecules as M = [x, v], where [·, ·] is the concatenation operator and x ∈ R M × 3 x ∈ R^{M×3} x∈RM×3 and v ∈ R M × K v ∈ R^{M×K} v∈RM×K denote atom Cartesian coordinates and one-hot atom types respectively.
3、Molecular diffusion process
use a Gaussian distribution
N
N
N to model continuous atom coordinates x and a categorical distribution C to model discrete atom types v. The atom types are constructed as a one-hot vector containing information such as element types and membership in an aromatic ring. We formulate the molecular distribution as a product of atom coordinate distribution and atom type distribution. At each time step t, a small Gaussian noise and a uniform noise across all categories are added to atom coordinates and atom types separately, according to a Markov chain with fixed variance schedules β1, . . . , βT (K为k维的平均噪声向量)(实际上x,v的调度不一致):
Denoting
α
t
=
1
−
β
t
αt = 1 − β_t
αt=1−βt and
a desirable property of the diffusion process is to calculate the noisy data distribution
q
(
M
t
∣
M
0
)
q(M_t|M_0)
q(Mt∣M0) of any time step in closed-form(用闭合形式直接求出每个时间步时数据分布):
Using Bayes theorem, the normal posterior of atom coordinates and categorical posterior of atom types can both be computed in closed-form(通过贝叶斯公式求出后验分布):
4、Molecular generative process
The generative process, on reverse, will recover the ground truth molecule M0 from the initial noise MT , and we approximate the reverse distribution with a neural network parameterized by θ(t、P已知,Mt也已知,求μθ 、cθ):
There are different ways to parameterize
μ
θ
(
[
x
t
,
v
t
]
,
t
,
P
)
μ_θ([x_t, v_t], t, P)
μθ([xt,vt],t,P) and
c
θ
(
[
x
t
,
v
t
]
,
t
,
P
)
c_θ([x_t, v_t], t, P)
cθ([xt,vt],t,P). Here, we choose to let the neural network predict
[
x
0
,
v
0
]
[x_0, v_0]
[x0,v0] and feed it through equation 4 to obtain
μ
θ
μ_θ
μθ and
c
θ
c_θ
cθ which define the posterior distributions. we model the interaction between the ligand molecule atoms and the protein atoms with a SE(3)-Equivariant GNN:
At the l-th layer, the atom hidden embedding h(原子隐藏嵌入) and coordinates x(原子的坐标) are updated alternately as follows:
where
d
i
j
=
‖
x
i
−
x
j
‖
d_{ij} = ‖x_i − x_j‖
dij=‖xi−xj‖ is the euclidean distance(原子间欧几里德距离) between two atoms i and j and eij is an additional feature(两两原子间连接特征,可以视为邻接矩阵来描述原子之间的联系或连接类型) indicating the connection is between protein atoms, ligand atoms or protein atom and ligand atom. 1mol is the ligand molecule mask since we do not want to update protein atom coordinates. The initial atom hidden embedding
h
0
h^0
h0 is obtained by an embedding layer that encodes the atom information. The final atom hidden embedding
h
L
h^L
hL is fed into a multi-layer perceptron and a softmax function to obtain
ˆ
v
0
ˆ v_0
ˆv0. Since
ˆ
x
0
ˆ x_0
ˆx0 is rotation equivariant to
x
t
x_t
xt and it is easy to see
x
t
−
1
x_{t−1}
xt−1 is rotation equivariant to
x
0
x_0
x0 according to equation 4, we achieve the desired equivariance for Markov transition.
注:the likelihood
p
θ
(
M
0
∣
P
)
p_θ(M_0|P)
pθ(M0∣P) should be invariant to translation and rotation of the protein-ligand complex. Denoting the SE(3)-transformation as
T
g
T_g
Tg, we could achieve invariant likelihood w.r.t
T
g
T_g
Tg on the protein-ligand complex:
p
θ
(
T
g
(
M
0
∣
P
)
)
=
p
θ
(
M
0
∣
P
)
p_θ(T_g(M_0|P)) = p_θ(M_0|P)
pθ(Tg(M0∣P))=pθ(M0∣P) if we shift the Center of Mass (CoM) of protein atoms to zero and parameterize the Markov transition
p
(
x
t
−
1
∣
x
t
,
x
P
)
p(x_{t−1}|x_t, x_P )
p(xt−1∣xt,xP) with an SE(3)-equivariant network.
5、Training
The combination of q and p is a variational auto-encoder (Kingma and Welling, 2013). The model can be trained by optimizing the variational bound on negative log likelihood. For the atom coordinate loss, since
q
(
x
t
−
1
∣
x
t
,
x
0
)
q(x_{t−1}|x_t, x_0)
q(xt−1∣xt,x0) and
p
θ
(
x
t
−
1
∣
x
t
)
p_θ(x_{t−1}|x_t)
pθ(xt−1∣xt) are both Gaussian distributions, the KL-divergence can be written in closed form:
where
and
C
C
C is a constant. In practice, training the model with an unweighted MSE loss (set
γ
t
γ_t
γt = 1) could also achieve better performance as Ho et al. (2020) suggested. For the atom type loss, we can directly compute KL-divergence of categorical distributions as follows:
The final loss is a weighted sum of atom coordinate loss and atom type loss:
L
=
L
t
−
1
(
x
)
+
λ
L
t
−
1
(
v
)
L = L^{(x)}_{t−1} + λL^{(v)}_{t−1}
L=Lt−1(x)+λLt−1(v). We summarize the overall training and sampling procedure of TargetDiff in Appendix E.
(1) training
(2) sampling
At the l-th layer, we dynamically construct the protein-ligand complex as a k-nearest neighbors (knn) graph based on known protein atom coordinates and current ligand atom coordinates, which is the output of the l − 1-th layer. We choose k = 32 in our experiments. The protein atom features include chemical elements, amino acid types and whether the atoms are backbone atoms. The ligand atom types are one-hot vectors consisting of the chemical element types and aromatic information. The edge features are the outer products of distance embedding and bond types, where we expand the distance with radial basis functions located at 20 centers between 0 ̊ A and 10 ̊ A and the bond type is a 4-dim one-hot vector indicating the connection is between protein atoms, ligand atoms, protein-ligand atoms or ligand-protein atoms.
6、Experiments
Data:Crossocked2022
Baseline:liGAN、AR、Pocket2Mol、GraphBP
Targetiff:Our model contains 9 equivariant layers described in equation 7, where fh and fx are specifically implemented as graph attention layers with 16 attention heads and 128 hidden features. We first decide on the number of atoms for sampling by drawing a prior distribution estimated from training complexes with similar binding pocket sizes. After the model finishes the generative process, we then use OpenBabel (O’Boyle et al., 2011) to construct the molecule from individual atom coordinates as done in AR and liGAN.
7、Results
文章来源:https://www.toymoban.com/news/detail-860669.html
8、Target Binding Affinity
We first establish the connection between unsupervised generative models and binding affinity ranking / prediction. Under our parameterization, the network predicts the denoised
[
ˆ
x
0
,
ˆ
v
0
]
[ˆ x_0, ˆ v_0]
[ˆx0,ˆv0]. Given the protein-ligand complex, we can feed
φ
θ
φ_θ
φθ with
[
x
0
,
v
0
]
[x_0, v_0]
[x0,v0] while freezing the x-update branch (i.e. only atom hidden embedding
h
h
h is updated), and we could finally obtain
h
L
h^L
hL and
ˆ
v
0
ˆ v_0
ˆv0:
Our assumption is that if the ligand molecule has a good binding affinity to protein, the flexibility of atom types should be low, which could be reflected in the entropy of
ˆ
v
0
ˆ v_0
ˆv0(v_ent). Therefore, it can be used as a scoring function to help ranking, whose effectiveness is justified in the experiments. In addition, hL also includes useful global information. We found the binding affinity ranking performance can be greatly improved by utilizing this feature with a simple linear transformation.
文章来源地址https://www.toymoban.com/news/detail-860669.html
到了这里,关于论文简读《3D Equivariant Diffusion For Target-Aware Molecule Generation and Affinity Prediction》的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!