You are here : Home > The lab > Proteogenomics: On the correct use of databases

highlight / actuality

Proteogenomics: On the correct use of databases

Proteogenomics researchers at our laboratory describe why protein identifications on too small databases yields (by means of a statistical artifact) a misidentification rate underestimation. This is particularly so when the database only retains the RNAs expressed in the biological tissue. This work shows a good example of interdisciplinary cooperation.

Published on 19 September 2022
Proteogenomics aims to use personalized in addition to the canonical databases of the species to better characterize the proteins of samples analyzed by mass spectrometry. To avoid the multiplication of sequence variants in these databases, which would then become too large and ambiguous, researchers have proposed to reduce these databases to the transcripts expressed in the biological sample analyzed. However, it appears that this reduction in database size artificially increases the confidence of peptide identification. Therefore, what biostatistical method can be considered to restore the results’ reliability?

Reference databases are essential for mass spectrometry based identification. Using them, the digital tool generates theoretical mass spectra, as to propose a list of amino acid sequences and their matching probabilities with the experimental mass spectra. Then, to validate the resulting identifications, it is necessary to estimate both the proportions of true matches (between a sequence and a spectrum) and the so-called “false discovery rate” (resulting from random matches devoid of biochemical ground). To do so, the most classical method is to include "decoy" sequences in the database, which are biologically irrelevant, as to subsequently count the number of experimental spectra matching them.

Researchers at our laboratory have shown that the smaller the database (because of its transcriptome informed-pruning) the more the false discovery rate is underestimated by the decoy count. They explain that the increase of identification sensitivity on a transcriptome-informed reduced database is in fact a statistical artefact: with a smaller database, fewer decoys are generated, which reduces the probability that some of them are sufficiently realistic to mimic identification errors. Because the false discovery rate is critical to the reliability of the downstream biological conclusions, the  researchers at our laboratory propose alternative statistical methods to control for it, and show these methods are less sensitive to the database size.

This calls into question the sensitivity increment induced by transcriptome-informed reduced databases, but the approach remains interesting as it facilitates the identification of ambiguous proteins by reducing the proportion of sequence homologies between different proteins. These results, implemented in data processing routines, contribute to the future of computational proteogenomics, and show a good example of interdisciplinary cooperation.
Proteogenomics combines proteomics (identification and quantification of all the proteins in a sample) with genomic/transcriptomic approaches. While genomics studies the DNA sequences of living beings, transcriptomics identifies and quantifies transcripts, i.e. RNAs resulting from the transcription of DNA. Transcriptomics makes it possible to estimate the level of expression of genes, whereas genomics does not.
Personalized database: from genomic and/or transcriptomic knowledge specific to the pathology studied, or even directly derived from the genome and/or transcriptome of each patient.
Sequence variants: the sequence of a gene changes from one individual to another.

Top page