There was a mistake yesterdary in the normalization of probabilities. Corrected today.
First, comparison of theoretical probabilities and actual ones (this is not new):
>> load codigo_MycobTub
>> load MT_CDC1551
>> genes=genoma.gene.sec(:);
>> genes=genes(genes~=0);
>> codones=gen2codones(genes);
>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> [probstop1_teor,probstop2_teor]=paresaa2probs(codigo,probcodones);
>> [probstop1,probstop2,casosfavorables,casostotales,n_saltados]=gen2probstop(genes,codigo);
>> subplot(1,2,1)
>> imagesc(probstop1./probstop1_teor)
>> colorbar
>> subplot(1,2,2)
>> imagesc(probstop2./probstop2_teor)
>> colorbar
I look for the probability, k, of finding a hidden stop codon between two given amino-acids. In the random case, k should be the proportion of synimous codons that give a hidden stop codon, and this may be weighted with the relative proportions of codons in the genome. Then, for the real genome, I count how many times I find the two aminoacids together, and in how many of those cases I find a hidden stop codon. With that information and Bayes, I can compute the probability of k. I do it, for example, for element (2,3) in frame 1:
>> load codigo_MycobTub
>> load MT_CDC1551
>> genes=genoma.gene.sec(:);
>> genes=genes(genes~=0);
>> codones=gen2codones(genes);
>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> [probstop1_teor,probstop2_teor]=paresaa2probs(codigo,probcodones);
>> [probstop1,probstop2,casosfavorables,casostotales,n_saltados]=gen2probstop(genes,codigo);
>> probstop1_teor(2,3,1)
ans =
0.6051
>> probstop1(2,3,1)
ans =
0.6246
>> [probk,k]=cuentas2probk_bayes(casosfavorables(2,3,1),casostotales(2,3,1));
>> [m,ind]=min(abs(k-probstop1_teor(2,3,1)));
>> sum(probk(1:ind))*diff(k(1:2))
ans =
0.0169
>> plot(k,probk)
I find that, for this element, the probability in the genes is higher than the theoretical probability. And that this difference is significant. Now, I do it for all pairs of aminoacids, in both reading frames:
>> prob_noefecto=gen2prob_noefecto(genes,codigo);
>> subplot(1,2,1)
>> imagesc(prob_noefecto(:,:,1))
>> colorbar
>> subplot(1,2,2)
>> imagesc(prob_noefecto(:,:,2))
>> colorbar
In logarithmic scale:
>> subplot(1,2,1)
imagesc(log10(prob_noefecto(:,:,1)))
colorbar
subplot(1,2,2)
imagesc(log10(prob_noefecto(:,:,2)))
colorbar

Some elements are significant, and some are not. This is compatible with randomness, but there are more significant elements than one would expect at random:
>> hist(prob_noefecto(:),20)

>> hist(log10(prob_noefecto(:)))
>> sum(probs_todas<.05)/length(probs_todas)
ans =
0.3047
>> sum(probs_todas<.01)/length(probs_todas)
ans =
0.2109
So 30% have p<0.05, and 21% have p<0.01. This suggests that there is an effect, but it might also be that some assumptions in the significance calculation are wrong.
I check, by randomizing the order of the codons, and performing the test:
>> codones_perm=codones(randperm(length(codones)));
>> genes_perm=codones2gen(codones_perm);
>> prob_noefecto_perm=gen2prob_noefecto(genes_perm,codigo);
>> hist(prob_noefecto_perm(:),20)

>> hist(log10(prob_noefecto_perm(:)))
>> probs_todas_perm=prob_noefecto_perm(~isnan(prob_noefecto_perm));
>> sum(probs_todas_perm<.05)/length(probs_todas_perm)
ans =
0.0391
>> sum(probs_todas_perm<.01)/length(probs_todas_perm)
ans =
0.0156
This one works fine.
So I would say there is something non-random in the genome.
>> sum(probs_todas>.95)/length(probs_todas)
ans =
0.1406
>> sum(probs_todas>.99)/length(probs_todas)
ans =
0.0859
There are also more pairs of aminoacids with fewer hidden stops than expected…
Next steps:
– Look for something special in the pairs of aminoacids with significant proportion of hidden stop codons (both too many or too few). Any aminoacid specially present? Any stop codon is more likely than the other two?
– Do the study gene by gene, and see whether there are non-random differences among genes.