June 2022 ESRFnews16
ARTIFICIAL INTELLIGENCE
Structure Prediction (CASP), a biannual, community- wide experiment that objectively rates the performance of structure predictions from teams around the world. Until 2016, the best predictions for proteins in CASP s hardest category those for which no model structures were avail- able were still, on its measure, only 30 40% accurate. Then in 2018 there was suddenly a jump, with one team achieving a median accuracy of 58%. In the next round, two years later, the same team s performance was even more startling, reaching a score in the hardest category of 87%. That team was DeepMind, a research laboratory
owned by Alphabet (which also owns Google) and
S TE
F C A N D É / E
S R F
The ESRF s Daniele de Sanctis helped on the project led by the Karolinska Institute to combine AlphaFold AI with MX protein structure determination.
based in London, which develops AlphaFold. Structural biologists the world over were stunned. Venki Ramakrishnan, a former ESRF user who shared the 2009 Nobel Prize in Chemistry for his structural studies of the ribosome, said that the computational advance occurred decades before many people in the field would have predicted . John Moult, one of the co-founders of CASP, was equally impressed. We have been stuck on this one problem how do proteins fold up for nearly 50 years, he said. To see DeepMind produce a solution is a very special moment. While many structural biologists were dwelling on
UNFOLDING ALPHAFOLD
In the last two CASP community experiments, particularly CASP14 in 2020, the AlphaFold AI system has delivered by far the most accurate predictions of protein structure for a task in which no model structures are available (see chart, right). How does it do it? The full workings of the latest iteration, AlphaFold 2, are
described in a 62-page supplement to the DeepMind team s paper last year (Nature 596 583). In short, the system begins by identifying similar amino-acid sequences a multiple sequence alignment (MSA) to the target protein from several databases. In parallel, it also identifies any known protein structures that could serve as models, and constructs a draft structure, known as a pair representation. Next, a neural network goes through an iteration cycle during which it attempts to identify, by comparisons between the MSA and the pair representation, which data are the most useful. In every
cycle the geometry improves, before another neural network generates a 3D model, containing coordinates for every atom. This model is taken back to the beginning of the process for a new iteration, so it can be refined further, until a high confidence in the structure is finally achieved.
100
m ed
ia n
fre e-
m od
el lin
g ac
cu ra
cy
80
60
40
20
0 CASP7 2006
CASP8 2008
CASP9 2010
CASP10 2012
CASP11 2014
CASP12 2016
ALPHAFOLD
ALPHAFOLD 2
CASP13 2018
CASP14 2020