## Hidden stops in the complete genome of M. Tuberculosis

I put all the genes together:

>> genes=genoma.gene.sec(:);
>> genes=genes(genes~=0);

Compute the probabilities of hidden stop codons between each pair of aminoacids:

>> [probstop_fr1_gen,probstop_fr2_gen]=gen2probstop(genes,codigo);
>> subplot(1,2,1)
>> imagesc(probstop_fr1_gen)
>> colorbar
>> subplot(1,2,2)
>> imagesc(probstop_fr2_gen)
>> colorbar

>> sum(probstop_fr1_gen(:))
ans =
16.8621
>> sum(probstop_fr2_gen(:))
ans =
19.1280

I compare them with the theoretical ones:

>> [probstop_fr1,probstop_fr2]=paresaa2probs(codigo);
>> figure
>> subplot(1,2,1)
>> imagesc(probstop_fr1)
>> colorbar
>> subplot(1,2,2)
>> imagesc(probstop_fr2)
>> colorbar
>> sum(probstop_fr1(:))
ans =
18.7500
>> sum(probstop_fr2(:))
ans =
24.5000

So the probability in the genes is actually LOWER than expected. Puaj.

But now, I will take into account the codon bias:

>> codones=gen2codones(genes);
>> hist(codones,1:65)

(codon 65 are unidentifiable codons, due to an error in sequentiation)

>> probcodones=hist(codones,1:65);
>> probcodones=probcodones(1:64);
>> probcodones=probcodones/sum(probcodones);
>> [probstop_fr1,probstop_fr2]=paresaa2probs(codigo,probcodones);
>> close all
>> subplot(1,2,1)
>> imagesc(probstop_fr1)
>> colorbar
>> subplot(1,2,2)
>> imagesc(probstop_fr2)
>> colorbar
>> sum(probstop_fr1(:))
ans =
16.6937
>> sum(probstop_fr2(:))
ans =
18.4024

With the codon bias, the theoretical prediction is slightly lower than the experimental result, especially in the frame 2 (frame -1). Let us see the relative probability:

>> subplot(1,2,1)
>> imagesc(probstop_fr1_gen./probstop_fr1)
>> colorbar
>> subplot(1,2,2)
>> imagesc(probstop_fr2_gen./probstop_fr2)
>> colorbar
>> nanmean(nanmean(probstop_fr1_gen./probstop_fr1))
ans =
1.0142
>> nanmean(nanmean(probstop_fr2_gen./probstop_fr2))
ans =
1.0366

Psé.

## Genetic code of Mycobacterium tuberculosis

I take the genome from genebank. The code for M. Tuberculosis is the following (in two formats):

11. The Bacterial, Archaeal and Plant Plastid Code (transl_table=11)

Starts = —M—————M————MMMM—————M————

Base1 = TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG

Base2 = TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG

Base3 = TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG

11. The Bacterial, Archaeal and Plant Plastid Code (transl_table=11)

TTT F Phe TCT S Ser TAT Y Tyr TGT C Cys

TTC F Phe TCC S Ser TAC Y Tyr TGC C Cys

TTA L Leu TCA S Ser TAA * Ter TGA * Ter

TTG L Leu i TCG S Ser TAG * Ter TGG W Trp

CTT L Leu CCT P Pro CAT H His CGT R Arg

CTC L Leu CCC P Pro CAC H His CGC R Arg

CTA L Leu CCA P Pro CAA Q Gln CGA R Arg

CTG L Leu i CCG P Pro CAG Q Gln CGG R Arg

ATT I Ile i ACT T Thr AAT N Asn AGT S Ser

ATC I Ile i ACC T Thr AAC N Asn AGC S Ser

ATA I Ile i ACA T Thr AAA K Lys AGA R Arg

ATG M Met i ACG T Thr AAG K Lys AGG R Arg

GTT V Val GCT A Ala GAT D Asp GGT G Gly

GTC V Val GCC A Ala GAC D Asp GGC G Gly

GTA V Val GCA A Ala GAA E Glu GGA G Gly

GTG V Val i GCG A Ala GAG E Glu GGG G Gly

It is the standard genetic code, except that several codons may be initiation codons, and in all cases code for Methionine.