Reanalysis of the probabilities of stop codons for each pair of aminoacids

PROBABILITIES IN THE GENES:

>> genes=genoma.gene.sec’;
>> genes=genes(genes~=0);
>> [probstop,casosfavorables,casostotales]=gen2probstop(genes,codigo);
>> subplot(1,2,1)
>> imagesc(sum(probstop{1}(:,:,:),3))
>> colorbar
>> subplot(1,2,2)
>> imagesc(sum(probstop{2}(:,:,:),3))
>> colorbar

THEORETICAL PROBABILITIES:

>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> probstop_teor=paresaa2probs(codigo,probcodones);
>> clf
>> subplot(1,2,1)
>> imagesc(sum(probstop_teor{1}(:,:,:),3))
>> codones=gen2codones(genes);
>> colorbar
>> subplot(1,2,2)
>> imagesc(sum(probstop_teor{2}(:,:,:),3))
>> colorbar

RATIO:

>> subplot(1,2,1)
>> imagesc(sum(probstop{1}(:,:,:),3)./sum(probstop_teor{1}(:,:,:),3))
>> colorbar
>> subplot(1,2,2)
>> imagesc(sum(probstop{2}(:,:,:),3)./sum(probstop_teor{2}(:,:,:),3))
>> colorbar

Advertisements

Catastrophe

All previous results about hidden codons are wrong. When I generate the vector with the complete genome I make

>> genes=genoma.gene.sec(:);
>> genes=genes(genes~=0);

The second instruction goes column-wise, so it is not taking the genes properly. Instead, it takes all first bases, then all second bases and so on.

Agh.

Note that some results were significant. We should check the reason for that…

Interaction between frames and between residues, and other small mistakes

I have just realized that the presence of a hidden stop in frame 1 may prevent the presence of a hidden stop in frame 2, and vice-versa. Therefore, the probabilities are not independent.

Also, the presence of a stop between two amino-acids may prevent (or facilitate) the presence of another stop between the neighbouring ones.

Also, our method is not taking everything into account: For the same pair of aminoacids, in some cases we may be at one single-point mutation of forming a stop codon, sometimes at two, three, etc.

Thus, I am going to switch to another analysis method, taking into account single-point mutations.

BUT ATTENTION: The very different probabilities for different aminoacid pairs are still unexplained. Especially, the large amount of aminoacid pairs with significantly lower probability of having hidden stops codons is unexplained. The new analysis will probably mix all the data, and we may lose this detail.

Analysis of hidden stops, separating the three stop codons

IMPORTANT: I change the programs gen2probstop.m and paresaa2probs.m

>> [probstop,casosfavorables,casostotales]=gen2probstop(genes,codigo);

First stop codon:

subplot(1,2,1)
imagesc(probstop{1}(:,:,1))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,1))
colorbar

Second:
subplot(1,2,1)
imagesc(probstop{1}(:,:,2))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,2))
colorbar

Third:

>> subplot(1,2,1)
imagesc(probstop{1}(:,:,3))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,3))
colorbar

Now I compare with the theoretical probabilities:

>> codones=gen2codones(genes);
>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> probstop_teor=paresaa2probs(codigo,probcodones);

Codón uaa:
>> c=1;
subplot(1,2,1)
imagesc(probstop{1}(:,:,c)./probstop_teor{1}(:,:,c))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,c)./probstop_teor{2}(:,:,c))
colorbar

Codón uga:

c=2;
subplot(1,2,1)
imagesc(probstop{1}(:,:,c)./probstop_teor{1}(:,:,c))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,c)./probstop_teor{2}(:,:,c))
colorbar

Codón uag:

c=3;
subplot(1,2,1)
imagesc(probstop{1}(:,:,c)./probstop_teor{1}(:,:,c))
colorbar
subplot(1,2,2)
imagesc(probstop{2}(:,:,c)./probstop_teor{2}(:,:,c))
colorbar

 

Probabilities of hidden stops in the intergenetic sequences

>> interg=genoma.intergenetica(:);
>> interg=interg(interg~=0);
>> rem(length(interg),3)
ans =
1
>> interg=interg(1:end-1);
>> rem(length(interg),3)
ans =
0
>> prob_noefecto_interg=gen2prob_noefecto(interg,codigo);

>> hist(log10(prob_noefecto_interg(:)))

Random, as expected.

Matrices:
>> subplot(1,2,1)
>> imagesc(prob_noefecto_interg(:,:,1))
>> colorbar
>> subplot(1,2,2)
>> imagesc(prob_noefecto_interg(:,:,2))
>> colorbar

>> subplot(1,2,1)
imagesc(log10(prob_noefecto_interg(:,:,1)))
colorbar
subplot(1,2,2)
imagesc(log10(prob_noefecto_interg(:,:,2)))
colorbar

Bayesian study of the probabilities of hidden stop codons

There was a mistake yesterdary in the normalization of probabilities. Corrected today.

First, comparison of theoretical probabilities and actual ones (this is not new):

>> load codigo_MycobTub
>> load MT_CDC1551
>> genes=genoma.gene.sec(:);
>> genes=genes(genes~=0);
>> codones=gen2codones(genes);
>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> [probstop1_teor,probstop2_teor]=paresaa2probs(codigo,probcodones);
>> [probstop1,probstop2,casosfavorables,casostotales,n_saltados]=gen2probstop(genes,codigo);
>> subplot(1,2,1)
>> imagesc(probstop1./probstop1_teor)
>> colorbar
>> subplot(1,2,2)
>> imagesc(probstop2./probstop2_teor)
>> colorbar

I look for the probability, k, of finding a hidden stop codon between two given amino-acids. In the random case, k should be the proportion of synimous codons that give a hidden stop codon, and this may be weighted with the relative proportions of codons in the genome. Then, for the real genome, I count how many times I find the two aminoacids together, and in how many of those cases I find a hidden stop codon. With that information and Bayes, I can compute the probability of k. I do it, for example, for element (2,3) in frame 1:

>> load codigo_MycobTub
>> load MT_CDC1551
>> genes=genoma.gene.sec(:);
>> genes=genes(genes~=0);
>> codones=gen2codones(genes);
>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> [probstop1_teor,probstop2_teor]=paresaa2probs(codigo,probcodones);
>> [probstop1,probstop2,casosfavorables,casostotales,n_saltados]=gen2probstop(genes,codigo);
>> probstop1_teor(2,3,1)
ans =
0.6051
>> probstop1(2,3,1)
ans =
0.6246
>> [probk,k]=cuentas2probk_bayes(casosfavorables(2,3,1),casostotales(2,3,1));
>> [m,ind]=min(abs(k-probstop1_teor(2,3,1)));
>> sum(probk(1:ind))*diff(k(1:2))
ans =
0.0169
>> plot(k,probk)

I find that, for this element, the probability in the genes is higher than the theoretical probability. And that this difference is significant. Now, I do it for all pairs of aminoacids, in both reading frames:

>> prob_noefecto=gen2prob_noefecto(genes,codigo);
>> subplot(1,2,1)
>> imagesc(prob_noefecto(:,:,1))
>> colorbar
>> subplot(1,2,2)
>> imagesc(prob_noefecto(:,:,2))
>> colorbar

In logarithmic scale:

>> subplot(1,2,1)
imagesc(log10(prob_noefecto(:,:,1)))
colorbar
subplot(1,2,2)
imagesc(log10(prob_noefecto(:,:,2)))
colorbar


Some elements are significant, and some are not. This is compatible with randomness, but there are more significant elements than one would expect at random:
>> hist(prob_noefecto(:),20)


>> hist(log10(prob_noefecto(:)))

>> sum(probs_todas<.05)/length(probs_todas)
ans =
0.3047
>> sum(probs_todas<.01)/length(probs_todas)
ans =
0.2109

So 30% have p<0.05, and 21% have p<0.01. This suggests that there is an effect, but it might also be that some assumptions in the significance calculation are wrong.

I check, by randomizing the order of the codons, and performing the test:

>> codones_perm=codones(randperm(length(codones)));
>> genes_perm=codones2gen(codones_perm);
>> prob_noefecto_perm=gen2prob_noefecto(genes_perm,codigo);
>> hist(prob_noefecto_perm(:),20)

>> hist(log10(prob_noefecto_perm(:)))

>> probs_todas_perm=prob_noefecto_perm(~isnan(prob_noefecto_perm));
>> sum(probs_todas_perm<.05)/length(probs_todas_perm)
ans =
0.0391
>> sum(probs_todas_perm<.01)/length(probs_todas_perm)
ans =
0.0156

This one works fine.

So I would say there is something non-random in the genome.

>> sum(probs_todas>.95)/length(probs_todas)
ans =
0.1406
>> sum(probs_todas>.99)/length(probs_todas)
ans =
0.0859

There are also more pairs of aminoacids with fewer hidden stops than expected…

Next steps:

– Look for something special in the pairs of aminoacids with significant proportion of hidden stop codons (both too many or too few). Any aminoacid specially present? Any stop codon is more likely than the other two?

– Do the study gene by gene, and see whether there are non-random differences among genes.

¿Error o equivocación?

He calculado la probabilidad de que, para cada par de aminoácidos, la cantidad de hidden stops que albergan se deba al azar. Así que es una especie de p-value.

Es para el genoma completo (las variables son las mismas que ayer)
>> prob_noefecto=gen2prob_noefecto(genes,codigo);
>> subplot(1,2,1)
>> imagesc(prob_noefecto(:,:,1))
>> subplot(1,2,2)
>> imagesc(prob_noefecto(:,:,2))

Sale significativo para todos los pares de aminoácidos. Teniendo en cuenta que ayer salía que para algunos pares de aminoácidos los stops eran incluso menos probables en los genes que en la teoría, hay un error en alguna parte.

EN EFECTO: Hay un error. Ver siguiente post.