SynVar
Variant synonyms generation and normalization
Background
Genetic variants are drawing increasing interest regarding their role in pathologies, for designing new drugs or refining treatment efficacy through stratification. However, variant interpretation depends on time-consuming curation tasks. To support variant interpretation efforts and decisions based on the latest evidences, we propose Variomes [1], a service performing variant-specific triage of publications.
To increase the comprehensiveness of Variomes, we developed SynVar. This tool enables the generation of synonyms and normalization of variants. This task faces different challenges:
- Variants can be represented at different levels - genomic, transcript or protein - with a combinatorial (many-to-many) relationship between them.
- Variant descriptions depend on a reference sequence on which the variation is described, to avoid positional ambiguity.
- The majority of variants mentioned in the literature do not follow a standard nomenclature.
While many databases of polymorphisms and somatic variants exist, such as ClinVar, ClinGen or dbSNP, using them as terminologies has several drawbacks:
- Depending on the database, variants are mapped on different levels, hindering a linear relationship between them, and some are position- but not change-specific, reducing specificity.
- Dependency to databases prevents retrieval of newly described variants in literature.
- It doesn't provide non-standard expressions as found in the literature.
Description
To enable a smooth and effective retrieval of variants in the literature, we developed a synonym generation tool that enables to generate for a given variant – including variants not described in existing databases – its corresponding description at the genome, cDNA/transcript and protein level, in the HGVS format as well as in many non standard – yet frequently used – descriptions found in the literature. It is adapted for variant expansion and normalization from any description level.
Supported variant types
SynVar supports the following variant types according to HGVS nomenclature:
- Substitutions (SNPs): Single nucleotide or amino acid changes (e.g. V600E, c.1799T>A, g.55181378G>A)
- Deletions: Deletion of one or more nucleotides or amino acids (e.g. E746_A750del, c.2235_2249del)
- Duplications: Duplication of one or more nucleotides or amino acids (e.g. V600dup, c.1799dup)
- Insertions: Insertion of one or more nucleotides or amino acids (e.g. c.7397_7398insT)
- Deletion-insertions (delins): Combined deletion and insertion (e.g. c.112_117delinsAT)
- Frameshifts: Variants causing a frameshift (e.g. p.Arg97fs, c.289delC)
Isoform support
SynVar can recognize and process variants specified on protein isoforms. When the optional parameter iso=true is provided, the tool expands the variant to all available isoforms of the gene. The system accepts:
- Gene names (e.g. TP53) - validates against the canonical isoform first. If the variant is not valid on the canonical isoform, the system automatically searches other isoforms. When iso=true, expands to all isoforms.
- RefSeq protein identifiers (e.g. NP_001119586.1) - recognizes the specific isoform corresponding to the RefSeq ID and expands to all isoforms when iso=true
Example: TP53 R248W with iso=true returns synonyms for all 9 TP53 isoforms. The variant is first validated on the canonical isoform (P04637-1). If not valid there, the system automatically searches other isoforms. With iso=true, all 9 isoforms are returned regardless of which isoform was initially validated.
Workflow
Use-cases
Protein variant: the change is validated on the reference sequence of the canonical isoform, by default, as retrieved by the UniProt API tool [2]. The valid variant is then backtranslated into the possible cDNA/transcript variants, using the back-translator tool from Mutalyzer [3]. Finally the cDNA variant is mapped onto its genomic position (GRCh37 and GRCh38 builds) using VariantValidator [4].
cDNA/transcript variant: the variant is validated and mapped onto genome position using VariantValidator [4], which also translates it into the corresponding protein variant.
Genomic variant: the variant is validated and converted to the cDNA/transcript variants using VariantValidator [4], if not intergenic. VariantValidator also provides the translation into protein variants. If intergenic, only genomic variant synonyms are generated.
dbSNP id: The different genomic variants associated to the dbSNP [5] id are retrieved through the NCBI eutils services. The conversion and translation procedure from genomic variant is similar to the one described above.
ClinGen Allele Registry ID: The genomic variant corresponding to the ClinGen Allele Registry ID (CA ID) is retrieved through the ClinGen Allele Registry [6]. The genomic mapping and translation is similar to the one described above.
Output
Results are returned as a list of genomic variants (unique position and change), along with their corresponding transcript and protein variants, grouped by genes and isoforms. The output is in XML format. The main elements are the following:
- synonym: Synonyms of gene and protein names.
- hgvs: Variant description in the standard HGVS format. The main HGVS description can be used as a unique identifier. Other HGVS are given for each level of description using the NCBI reference sequences.
- syntactic-variation: Variant expressions as encountered in the literature.
Programmatic access
URL
https://synvar.sibils.org/generate/literature/fromMutation
Parameters
- variant: Variant description, ClinGen Allele Registry ID, or dbSNP id (e.g. V617F, Val600Glu, rs113488022, CA251544, BRAF V600E). Required.
Optional parameters
- ref: Gene name, chromosome number or name (e.g. JAK2, BRAF, 9, X). Optional. If not provided and the variant parameter contains the gene/reference information (e.g. BRAF V600E), the system will automatically extract it. Also optional when using database identifiers (dbSNP, ClinGen).
- level: Level of the provided variant description: protein, transcript, genome, dbsnp, or clingen. Optional (default: any). When set to any or omitted, the system attempts to detect the variant level automatically based on the variant syntax and on the validity of the variant at each level. Note: Specifying the level explicitly is more efficient as it avoids testing all possible levels.
- iso: Validate on and generate synonyms for isoforms: false (default) or true. When set to true, detects and expands the variant to all available isoforms of the gene.
- map: Require genome mapping for output: true (default) or false. When set to false, outputs syntactic variations even if the variant could not be mapped to the genome. Useful for generating literature search terms for variants that cannot be validated or mapped.
- norm: Return only normalized identifiers: false (default) or true. When set to true, returns only HGVS, dbSNP ID, and ClinGen Allele ID without syntactic variations.
- format: Output format: xml (default), json (same structure as XML but in JSON format), or beacon (Beacon v2 JSON format).
Examples
Substitutions (SNPs)
Deletions
Duplications and Insertions
Isoform-specific queries
Database identifiers
Special cases with map parameter
Automatic detection (without ref or level parameters)
Variant extraction from complex text
Normalization only (norm parameter)
Search interface
Fields
- Gene/Chromosome: Gene name or chromosome number/name (e.g. JAK2, BRAF, 9, X, MT). The field can be empty if a dbSNP or ClinGen Allele Registry ID is searched.
- Variant: Variant in the following format: V600E (for amino acid sequence) or 1799T>A (for DNA sequence) or a dbSNP id (e.g. rs113488022) or ClinGen Allele Registry ID (e.g. CA251544).
- Level: Level of the provided variant description (protein, transcript, genome, dbsnp or clingen).
Template program
To query the service and parse the output: queryVariant.py
References
- Mottaz A, Pasche E, Michel PA, Mottin L, Teodoro D, Ruch P. Designing an Optimal Expansion Method to Improve the Recall of a Genomic Variant Curation-Support Service. Stud Health Technol Inform. 2022 May 25;294:839-843. doi: 10.3233/SHTI220603. PubMed
- Pasche E, Mottaz A, Caucheteur D, Gobeill J, Michel PA, Ruch P. Variomes: a high recall search engine to support the curation of genomic variants. Bioinformatics. 2022 Apr 28;38(9):2595-2601. doi: 10.1093/bioinformatics/btac146. PubMed>
- The UniProt Consortium (2023). UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51(D1), D523–D531. https://doi.org/10.1093/nar/gkac1052
- den Dunnen J. T. (2016). Sequence Variant Descriptions: HGVS Nomenclature and Mutalyzer. Current protocols in human genetics, 90, 7.13.1–7.13.19. https://doi.org/10.1002/cphg.2
- Freeman, P. J., Hart, R. K., Gretton, L. J., Brookes, A. J., & Dalgleish, R. (2018). VariantValidator: Accurate validation, mapping, and formatting of sequence variation descriptions. Human mutation, 39(1), 61–68. https://doi.org/10.1002/humu.23348
- Smigielski, E. M., Sirotkin, K., Ward, M., & Sherry, S. T. (2000). dbSNP: a database of single nucleotide polymorphisms. Nucleic acids research, 28(1), 352–355. https://doi.org/10.1093/nar/28.1.352
- Pawliczek, P., Patel, R. Y., Ashmore, L. R., Jackson, A. R., Bizon, C., Nelson, T., Powell, B., Freimuth, R. R., Strande, N., Shah, N., Riegel, B., Meeks, M., Levy, M. A., Kattman, B., Berg, J. S., & Harrison, S. M. (2018). ClinGen Allele Registry links information about genetic variants. Human mutation, 39(11), 1690–1701. https://doi.org/10.1002/humu.23637