A place to cache linked articles (think custom and personal wayback machine)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

index.md 11KB

3 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252
  1. title: Reverse Engineering Source Code of the Biontech Pfizer Vaccine: Part 2
  2. url: https://berthub.eu/articles/posts/part-2-reverse-engineering-source-code-of-the-biontech-pfizer-vaccine/
  3. hash_url: 5f1c22e9a41d209ff84218b3d6faf676
  4. <p>All BNT162b2 vaccine data on this page is sourced from this <a href="https://mednet-communities.net/inn/db/media/docs/11889.doc" target="_blank">World Health
  5. Organization
  6. document</a>.</p>
  7. <blockquote>
  8. <p>This is a living page, shared already so people can get going! But
  9. check back frequently for updates.</p>
  10. </blockquote>
  11. <p><em>Translation</em>:
  12. <a href="https://renaudguerin.net/posts/partie-2-explorons-le-code-source-du-vaccin-biontech-pfizer/" target="_blank">Français</a>
  13. / <a href="https://msakai.github.io/bnt162b2/part-2-reverse-engineering-source-code-of-the-biontech-pfizer-vaccine.ja/" target="_blank">日本語</a></p>
  14. <p>In short: the vaccine mRNA has been optimized by the manufacturer by
  15. changing bits of RNA from (say) <code>UUU</code> to <code>UUC</code>, and people would like to
  16. understand the logic behind these changes. This challenge is quite close to what
  17. cryptologists and reverse engineering people encounter regularly. On this
  18. page, you’ll find all the details you need to get cracking to reverse
  19. engineer just HOW the vaccine has been optimized.</p>
  20. <p>I thought this would just be a fun puzzle, but I have just been informed that
  21. figuring out the optimization procedure &amp; documenting it is tremendously
  22. important for researchers around the world, as this would help them design
  23. code for proteins and vaccines.</p>
  24. <p>So, if you want to help vaccine research, do read on!</p>
  25. <h2 id="the-leader-board">The leader board</h2>
  26. <p>Here are the current best entrants to the optimization algorithm (average of 20 runs):</p>
  27. <h2 id="biontech">BioNTech</h2>
  28. <p>We should all be very grateful that BioNTech has shared this data with us.
  29. And of course we should also be grateful to the many many researchers and
  30. lab workers that worked for decades to bring the state of the art to the
  31. point that such a vaccine could be developed. It is marvelous.</p>
  32. <p>Because it is so marvelous, I want to understand everything about the
  33. vaccine. I wrote a page <a href="https://berthub.eu/articles/posts/reverse-engineering-source-code-of-the-biontech-pfizer-vaccine/" target="_blank">Reverse Engineering the source code of the BioNTech/Pfizer SARS-CoV-2
  34. Vaccine</a>
  35. that describes in some detail what is in the mRNA of the vaccine. It helps
  36. to read this page before continuing, I promise you it will be interesting.</p>
  37. <p>The post left open some questions however, and this is where it gets
  38. fascinating.</p>
  39. <h2 id="the-codon-optimization">The codon optimization</h2>
  40. <p>The vaccine contains RNA code for a very <em>slightly</em> modified copy of the
  41. SARS-CoV-2 S protein.</p>
  42. <p>The RNA code of the vaccine itself however is <em>highly</em> modified from the viral original!
  43. This has been done by the manufacturer, based on their understanding of
  44. nature.</p>
  45. <p>And from what we understand, these modifications make the vaccine <strong>much
  46. much more</strong> effective. It would be a lot of fun to understand these
  47. modifications. It might for example explain why the Moderna vaccine needs
  48. 100 micrograms and the BioNTech vaccine only 30 micrograms.</p>
  49. <p>Here is the beginning of the S protein in both the virus and the BNT162b2
  50. vaccine RNA code. Exclamation marks denote differences.</p>
  51. <pre><code>Virus: AUG UUU GUU UUU CUU GUU UUA UUG CCA CUA GUC UCU AGU CAG UGU GUU
  52. Vaccine: AUG UUC GUG UUC CUG GUG CUG CUG CCU CUG GUG UCC AGC CAG UGU GUG
  53. ! ! ! ! ! ! ! ! ! ! ! ! !
  54. </code></pre>
  55. <p>RNA is a string (literally) of RNA characters, <code>A</code>, <code>C</code>, <code>G</code> and <code>U</code>. There is no
  56. physical framing on there, but it makes sense to analyse it in groups of
  57. three.</p>
  58. <p>Each group (called a codon) maps to an amino acid (denoted by a capital
  59. letter). A string of amino acids is a protein. Here is what that looks
  60. like:</p>
  61. <pre><code>Virus: AUG UUU GUU UUU CUU GUU UUA UUG CCA CUA GUC UCU AGU CAG UGU GUU
  62. M F V F L V L L P L V S S Q C V
  63. Vaccine: AUG UUC GUG UUC CUG GUG CUG CUG CCU CUG GUG UCC AGC CAG UGU GUG
  64. ! ! ! ! ! ! ! ! ! ! ! ! !
  65. </code></pre>
  66. <p>Here we can see that while the codons are different, the amino acid version
  67. is the same. There are 4*4*4 codons but only 20 amino acids. This means you
  68. can typically change every codon into one of two others, and still code for
  69. the same amino acid.</p>
  70. <p>So in the second codon, <code>UUU</code> was changed to <code>UUC</code>. This is a net addition
  71. of one ‘C’ to the vaccine. The third codon changed from <code>GUU</code> to <code>GUG</code>, which is
  72. a net addition of one <code>G</code>.</p>
  73. <p><strong>It is known that a higher fraction of <code>G</code> and <code>C</code> characters improves the
  74. efficiency of an mRNA vaccine</strong>.</p>
  75. <p>Now, if that was all there was to it, this could be the end of this page.
  76. “The algorithm is change codons so we get more G and C in there”. But then
  77. we meet the 9th codon which changes <code>CCA</code> to <code>CCU</code>.</p>
  78. <p>Throughout the ~4000 characters of the vaccine, this happens many times.</p>
  79. <h2 id="our-challenge">Our challenge</h2>
  80. <p>The goal is: find an algorithm that modifies the ‘wild type’ RNA code into
  81. the BNT162b2 one. Because everyone would like to understand how to turn
  82. viral RNA into an effective vaccine. The algorithm does not need to
  83. reproduce the <em>exact</em> RNA code of course, but it would be super nice if it
  84. came up with something very similar, while also being brief.</p>
  85. <p>To help you, I have provided the data in a number of forms, as described on
  86. <a href="https://github.com/berthubert/bnt162b2" target="_blank">the GitHub page</a>.</p>
  87. <blockquote>
  88. <p>Note that in these files the <code>U</code> mentioned above appears as a <code>T</code>. <code>U</code> and
  89. <code>T</code> are the RNA and DNA manifestations of the same information.</p>
  90. </blockquote>
  91. <p>The easiest place to start might be the
  92. ‘<a href="https://github.com/berthubert/bnt162b2/blob/master/side-by-side.csv" target="_blank">side-by-side.csv</a>‘
  93. file. This lists the original and modified version of each codon, side by
  94. side:</p>
  95. <pre><code>abspos,codonOrig,codonVaccine
  96. 0,ATG,ATG
  97. 3,TTT,TTC
  98. 6,GTT,GTG
  99. ...
  100. 3813,TAC,TAC
  101. 3816,ACA,ACA
  102. 3819,TAA,TGA
  103. </code></pre>
  104. <p>There is also an equivalency table that shows wich codons can be
  105. interchanged without changing the amino acid output. Please find this in
  106. <a href="https://github.com/berthubert/bnt162b2/blob/master/codon-table-grouped.csv" target="_blank">codon-table-grouped.csv</a>.
  107. There is also a visual version
  108. <a href="https://en.wikipedia.org/wiki/DNA_and_RNA_codon_tables#Standard_DNA_codon_table" target="_blank">here</a>.</p>
  109. <h2 id="a-sample-algorithm">A sample algorithm</h2>
  110. <p>On the <a href="https://github.com/berthubert/bnt162b2" target="_blank">GitHub repository</a> you can
  111. find
  112. <a href="https://github.com/berthubert/bnt162b2/blob/master/3rd-gc.go" target="_blank">3rd-gc.go</a>
  113. (and
  114. <a href="https://github.com/berthubert/bnt162b2/blob/master/3rd-gc.py" target="_blank">3rd-gc.py</a>).</p>
  115. <p>These implement a simple strategy that works like this:</p>
  116. <ul>
  117. <li>If a virus codon already ended on G or C, copy it to the vaccine mRNA</li>
  118. <li>If not, replace last nucleotide in codon by a G, see if the amino acid
  119. still matches, if so, copy to the vaccine mRNA</li>
  120. <li>Try the same with a C</li>
  121. <li>Otherwise copy as is</li>
  122. </ul>
  123. <p>Or in <code>golang</code>:</p>
  124. <pre><code>// base case, don't do anything
  125. our = vir
  126. // don't do anything if codon ends on G or C already
  127. if(vir[2] == 'G' || vir[2] =='C') {
  128. fmt.Printf("Codon ended on G or C already, not doing anything.")
  129. } else {
  130. prop = vir[:2]+"G"
  131. fmt.Printf("Attempting G substitution, new candidate '%s'. ", prop)
  132. if(c2s[vir] == c2s[prop]) {
  133. fmt.Printf("Amino acid still the same, done!")
  134. our = prop
  135. } else {
  136. fmt.Printf("Oops, amino acid changed. Trying C, new candidate '%s'. ", prop)
  137. prop = vir[:2]+"C"
  138. if(c2s[vir] == c2s[prop]) {
  139. fmt.Printf("Amino acid still the same, done!")
  140. our=prop
  141. }
  142. }
  143. }
  144. </code></pre>
  145. <p>This achieves a rather poor 53.1% match with the BioNTech RNA vaccine, but
  146. it is a start.</p>
  147. <p>When you design your algorithm, be sure to only base your choices on the
  148. virus RNA. Do not peek into the BioNTech RNA!</p>
  149. <p>If you have achieved a score beyond 53.1% please email a link to your code
  150. to bert@hubertnet.nl (or <a href="https://twitter.com/PowerDNS_Bert" target="_blank">@PowerDNS_Bert</a>
  151. and I’ll put it on the leader board at the top of this page!</p>
  152. <h2 id="things-that-will-help">Things that will help</h2>
  153. <p>As with every form of reverse engineering or cryptanalysis, it helps to
  154. understand what we are looking at.</p>
  155. <h2 id="gc-ratio">GC ratio</h2>
  156. <p>We know that one goal of the ‘codon optimization’ is to get more <code>C</code>s and
  157. <code>G</code>s into the vaccine version of the RNA. However, there is also a limit to
  158. that. In DNA, which is also used to manufacture the vaccine, <code>G</code> and <code>C</code>
  159. bind together strongly, to the point that if you put too many of these
  160. ‘nucleotides’ in there, the DNA will no longer be replicated efficiently.</p>
  161. <p>So some modifications may actually happen to manage <em>down</em> the GC percentage of a
  162. stretch of DNA if it was getting too high.</p>
  163. <p>I <a href="https://twitter.com/PowerDNS_Bert/status/1344036143961169920" target="_blank">tweeted about this</a> earlier.</p>
  164. <h2 id="codon-optimization">Codon optimization</h2>
  165. <p>Some codons are rare in human DNA, or in certain cells. It may be that some
  166. codons are replaced by other ones simply because they are more frequently
  167. used by some cells.</p>
  168. <p>I <a href="https://twitter.com/PowerDNS_Bert/status/1344400081802448897" target="_blank">tweeted about this</a>
  169. earlier.</p>
  170. <h2 id="rna-folding">RNA folding</h2>
  171. <p>We’ve been looking at codons up to here. The RNA itself however does not
  172. know about codons, there are no markers that say where a codon begins and
  173. ends. The first codon on a protein however is always ATG (or AUG in RNA).</p>
  174. <p>RNA curls up into a shape. This shape might help evade the immune system or
  175. it might improve translation into amino acids. This only depends on the
  176. sequence of RNA nucleotides and not on specific codons.</p>
  177. <p>You can submit RNA sequences to <a href="http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi" target="_blank">this server of the Institute for
  178. Theoretical Chemistry at the University of
  179. Vienna</a> and it
  180. will fold RNA for you. This is a very advanced server that does meticulous
  181. calculations.</p>
  182. <p>This <a href="https://en.wikipedia.org/wiki/Nucleic_acid_structure_prediction" target="_blank">Wikipedia
  183. page</a>
  184. describes how this works.</p>
  185. <p>It may be that some optimizations improve folding.</p>
  186. <p>I am also told that this paper by Moderna (another mRNA vaccine
  187. manufacturer) may be relevant:
  188. <a href="https://www.pnas.org/content/116/48/24075" target="_blank">mRNA structure regulates protein expression through changes in functional
  189. half-life</a>.</p>