Reanalysis of the probabilities of stop codons for each pair of aminoacids

PROBABILITIES IN THE GENES:

>> genes=genoma.gene.sec’;
>> genes=genes(genes~=0);
>> [probstop,casosfavorables,casostotales]=gen2probstop(genes,codigo);
>> subplot(1,2,1)
>> imagesc(sum(probstop{1}(:,:,:),3))
>> colorbar
>> subplot(1,2,2)
>> imagesc(sum(probstop{2}(:,:,:),3))
>> colorbar

THEORETICAL PROBABILITIES:

>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> probstop_teor=paresaa2probs(codigo,probcodones);
>> clf
>> subplot(1,2,1)
>> imagesc(sum(probstop_teor{1}(:,:,:),3))
>> codones=gen2codones(genes);
>> colorbar
>> subplot(1,2,2)
>> imagesc(sum(probstop_teor{2}(:,:,:),3))
>> colorbar

RATIO:

>> subplot(1,2,1)
>> imagesc(sum(probstop{1}(:,:,:),3)./sum(probstop_teor{1}(:,:,:),3))
>> colorbar
>> subplot(1,2,2)
>> imagesc(sum(probstop{2}(:,:,:),3)./sum(probstop_teor{2}(:,:,:),3))
>> colorbar

Advertisement

Catastrophe

All previous results about hidden codons are wrong. When I generate the vector with the complete genome I make

>> genes=genoma.gene.sec(:);
>> genes=genes(genes~=0);

The second instruction goes column-wise, so it is not taking the genes properly. Instead, it takes all first bases, then all second bases and so on.

Agh.

Note that some results were significant. We should check the reason for that…

Interaction between frames and between residues, and other small mistakes

I have just realized that the presence of a hidden stop in frame 1 may prevent the presence of a hidden stop in frame 2, and vice-versa. Therefore, the probabilities are not independent.

Also, the presence of a stop between two amino-acids may prevent (or facilitate) the presence of another stop between the neighbouring ones.

Also, our method is not taking everything into account: For the same pair of aminoacids, in some cases we may be at one single-point mutation of forming a stop codon, sometimes at two, three, etc.

Thus, I am going to switch to another analysis method, taking into account single-point mutations.

BUT ATTENTION: The very different probabilities for different aminoacid pairs are still unexplained. Especially, the large amount of aminoacid pairs with significantly lower probability of having hidden stops codons is unexplained. The new analysis will probably mix all the data, and we may lose this detail.

Analysis of hidden stops, separating the three stop codons

IMPORTANT: I change the programs gen2probstop.m and paresaa2probs.m

>> [probstop,casosfavorables,casostotales]=gen2probstop(genes,codigo);

First stop codon:

subplot(1,2,1)
imagesc(probstop{1}(:,:,1))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,1))
colorbar

Second:
subplot(1,2,1)
imagesc(probstop{1}(:,:,2))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,2))
colorbar

Third:

>> subplot(1,2,1)
imagesc(probstop{1}(:,:,3))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,3))
colorbar

Now I compare with the theoretical probabilities:

>> codones=gen2codones(genes);
>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> probstop_teor=paresaa2probs(codigo,probcodones);

Codón uaa:
>> c=1;
subplot(1,2,1)
imagesc(probstop{1}(:,:,c)./probstop_teor{1}(:,:,c))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,c)./probstop_teor{2}(:,:,c))
colorbar

Codón uga:

c=2;
subplot(1,2,1)
imagesc(probstop{1}(:,:,c)./probstop_teor{1}(:,:,c))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,c)./probstop_teor{2}(:,:,c))
colorbar

Codón uag:

c=3;
subplot(1,2,1)
imagesc(probstop{1}(:,:,c)./probstop_teor{1}(:,:,c))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,c)./probstop_teor{2}(:,:,c))
colorbar

 

Probabilities of hidden stops in the intergenetic sequences

>> interg=genoma.intergenetica(:);
>> interg=interg(interg~=0);
>> rem(length(interg),3)
ans =
1
>> interg=interg(1:end-1);
>> rem(length(interg),3)
ans =
0
>> prob_noefecto_interg=gen2prob_noefecto(interg,codigo);

>> hist(log10(prob_noefecto_interg(:)))

Random, as expected.

Matrices:
>> subplot(1,2,1)
>> imagesc(prob_noefecto_interg(:,:,1))
>> colorbar
>> subplot(1,2,2)
>> imagesc(prob_noefecto_interg(:,:,2))
>> colorbar

>> subplot(1,2,1)
imagesc(log10(prob_noefecto_interg(:,:,1)))
colorbar
subplot(1,2,2)
imagesc(log10(prob_noefecto_interg(:,:,2)))
colorbar

Bayesian study of the probabilities of hidden stop codons

There was a mistake yesterdary in the normalization of probabilities. Corrected today.

First, comparison of theoretical probabilities and actual ones (this is not new):

>> load codigo_MycobTub
>> load MT_CDC1551
>> genes=genoma.gene.sec(:);
>> genes=genes(genes~=0);
>> codones=gen2codones(genes);
>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> [probstop1_teor,probstop2_teor]=paresaa2probs(codigo,probcodones);
>> [probstop1,probstop2,casosfavorables,casostotales,n_saltados]=gen2probstop(genes,codigo);
>> subplot(1,2,1)
>> imagesc(probstop1./probstop1_teor)
>> colorbar
>> subplot(1,2,2)
>> imagesc(probstop2./probstop2_teor)
>> colorbar

I look for the probability, k, of finding a hidden stop codon between two given amino-acids. In the random case, k should be the proportion of synimous codons that give a hidden stop codon, and this may be weighted with the relative proportions of codons in the genome. Then, for the real genome, I count how many times I find the two aminoacids together, and in how many of those cases I find a hidden stop codon. With that information and Bayes, I can compute the probability of k. I do it, for example, for element (2,3) in frame 1:

>> load codigo_MycobTub
>> load MT_CDC1551
>> genes=genoma.gene.sec(:);
>> genes=genes(genes~=0);
>> codones=gen2codones(genes);
>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> [probstop1_teor,probstop2_teor]=paresaa2probs(codigo,probcodones);
>> [probstop1,probstop2,casosfavorables,casostotales,n_saltados]=gen2probstop(genes,codigo);
>> probstop1_teor(2,3,1)
ans =
0.6051
>> probstop1(2,3,1)
ans =
0.6246
>> [probk,k]=cuentas2probk_bayes(casosfavorables(2,3,1),casostotales(2,3,1));
>> [m,ind]=min(abs(k-probstop1_teor(2,3,1)));
>> sum(probk(1:ind))*diff(k(1:2))
ans =
0.0169
>> plot(k,probk)

I find that, for this element, the probability in the genes is higher than the theoretical probability. And that this difference is significant. Now, I do it for all pairs of aminoacids, in both reading frames:

>> prob_noefecto=gen2prob_noefecto(genes,codigo);
>> subplot(1,2,1)
>> imagesc(prob_noefecto(:,:,1))
>> colorbar
>> subplot(1,2,2)
>> imagesc(prob_noefecto(:,:,2))
>> colorbar

In logarithmic scale:

>> subplot(1,2,1)
imagesc(log10(prob_noefecto(:,:,1)))
colorbar
subplot(1,2,2)
imagesc(log10(prob_noefecto(:,:,2)))
colorbar


Some elements are significant, and some are not. This is compatible with randomness, but there are more significant elements than one would expect at random:
>> hist(prob_noefecto(:),20)


>> hist(log10(prob_noefecto(:)))

>> sum(probs_todas<.05)/length(probs_todas)
ans =
0.3047
>> sum(probs_todas<.01)/length(probs_todas)
ans =
0.2109

So 30% have p<0.05, and 21% have p<0.01. This suggests that there is an effect, but it might also be that some assumptions in the significance calculation are wrong.

I check, by randomizing the order of the codons, and performing the test:

>> codones_perm=codones(randperm(length(codones)));
>> genes_perm=codones2gen(codones_perm);
>> prob_noefecto_perm=gen2prob_noefecto(genes_perm,codigo);
>> hist(prob_noefecto_perm(:),20)

>> hist(log10(prob_noefecto_perm(:)))

>> probs_todas_perm=prob_noefecto_perm(~isnan(prob_noefecto_perm));
>> sum(probs_todas_perm<.05)/length(probs_todas_perm)
ans =
0.0391
>> sum(probs_todas_perm<.01)/length(probs_todas_perm)
ans =
0.0156

This one works fine.

So I would say there is something non-random in the genome.

>> sum(probs_todas>.95)/length(probs_todas)
ans =
0.1406
>> sum(probs_todas>.99)/length(probs_todas)
ans =
0.0859

There are also more pairs of aminoacids with fewer hidden stops than expected…

Next steps:

– Look for something special in the pairs of aminoacids with significant proportion of hidden stop codons (both too many or too few). Any aminoacid specially present? Any stop codon is more likely than the other two?

– Do the study gene by gene, and see whether there are non-random differences among genes.

¿Error o equivocación?

He calculado la probabilidad de que, para cada par de aminoácidos, la cantidad de hidden stops que albergan se deba al azar. Así que es una especie de p-value.

Es para el genoma completo (las variables son las mismas que ayer)
>> prob_noefecto=gen2prob_noefecto(genes,codigo);
>> subplot(1,2,1)
>> imagesc(prob_noefecto(:,:,1))
>> subplot(1,2,2)
>> imagesc(prob_noefecto(:,:,2))

Sale significativo para todos los pares de aminoácidos. Teniendo en cuenta que ayer salía que para algunos pares de aminoácidos los stops eran incluso menos probables en los genes que en la teoría, hay un error en alguna parte.

EN EFECTO: Hay un error. Ver siguiente post.

Hidden stops in the complete genome of M. Tuberculosis

>> load MT_CDC1551

I put all the genes together:

>> genes=genoma.gene.sec(:);
>> genes=genes(genes~=0);

Compute the probabilities of hidden stop codons between each pair of aminoacids:

>> load codigo_MycobTub
>> [probstop_fr1_gen,probstop_fr2_gen]=gen2probstop(genes,codigo);
>> subplot(1,2,1)
>> imagesc(probstop_fr1_gen)
>> colorbar
>> subplot(1,2,2)
>> imagesc(probstop_fr2_gen)
>> colorbar

>> sum(probstop_fr1_gen(:))
ans =
16.8621
>> sum(probstop_fr2_gen(:))
ans =
19.1280

I compare them with the theoretical ones:

>> [probstop_fr1,probstop_fr2]=paresaa2probs(codigo);
>> figure
>> subplot(1,2,1)
>> imagesc(probstop_fr1)
>> colorbar
>> subplot(1,2,2)
>> imagesc(probstop_fr2)
>> colorbar
>> sum(probstop_fr1(:))
ans =
18.7500
>> sum(probstop_fr2(:))
ans =
24.5000

So the probability in the genes is actually LOWER than expected. Puaj.

But now, I will take into account the codon bias:

>> codones=gen2codones(genes);
>> hist(codones,1:65)

(codon 65 are unidentifiable codons, due to an error in sequentiation)

>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> [probstop_fr1,probstop_fr2]=paresaa2probs(codigo,probcodones);
>> close all
>> subplot(1,2,1)
>> imagesc(probstop_fr1)
>> colorbar
>> subplot(1,2,2)
>> imagesc(probstop_fr2)
>> colorbar
>> sum(probstop_fr1(:))
ans =
16.6937
>> sum(probstop_fr2(:))
ans =
18.4024

With the codon bias, the theoretical prediction is slightly lower than the experimental result, especially in the frame 2 (frame -1). Let us see the relative probability:

>> subplot(1,2,1)
>> imagesc(probstop_fr1_gen./probstop_fr1)
>> colorbar
>> subplot(1,2,2)
>> imagesc(probstop_fr2_gen./probstop_fr2)
>> colorbar
>> nanmean(nanmean(probstop_fr1_gen./probstop_fr1))
ans =
1.0142
>> nanmean(nanmean(probstop_fr2_gen./probstop_fr2))
ans =
1.0366

Psé.

Genetic code of Mycobacterium tuberculosis

I take the genome from genebank. The code for M. Tuberculosis is the following (in two formats):


 

11. The Bacterial, Archaeal and Plant Plastid Code (transl_table=11)

AAs = FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG

Starts = —M—————M————MMMM—————M————

Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

Base2 = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

Base3 = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

 

11. The Bacterial, Archaeal and Plant Plastid Code (transl_table=11)

TTT F Phe TCT S Ser TAT Y Tyr TGT C Cys

TTC F Phe TCC S Ser TAC Y Tyr TGC C Cys

TTA L Leu TCA S Ser TAA * Ter TGA * Ter

TTG L Leu i TCG S Ser TAG * Ter TGG W Trp

 

CTT L Leu CCT P Pro CAT H His CGT R Arg

CTC L Leu CCC P Pro CAC H His CGC R Arg

CTA L Leu CCA P Pro CAA Q Gln CGA R Arg

CTG L Leu i CCG P Pro CAG Q Gln CGG R Arg

 

ATT I Ile i ACT T Thr AAT N Asn AGT S Ser

ATC I Ile i ACC T Thr AAC N Asn AGC S Ser

ATA I Ile i ACA T Thr AAA K Lys AGA R Arg

ATG M Met i ACG T Thr AAG K Lys AGG R Arg

 

GTT V Val GCT A Ala GAT D Asp GGT G Gly

GTC V Val GCC A Ala GAC D Asp GGC G Gly

GTA V Val GCA A Ala GAA E Glu GGA G Gly

GTG V Val i GCG A Ala GAG E Glu GGG G Gly

 

It is the standard genetic code, except that several codons may be initiation codons, and in all cases code for Methionine.