DECIPHER - News

News

Updates to DECIPHER are released every six months via Bioconductor. Here is a list of changes in past releases.

Version 3.4

General changes

DetectRepeats()'s default parameters were retrained on an expanded empirical dataset.
FindSynteny() was improved and its default parameters were retrained.
Manual pages and vignettes were revised substantially in some cases.
Clusterize() now supports any measure of distance offered by DistanceMatrix().

Phylogenetics improvements

Treeline() is faster, especially for constructing maximum likelihood trees.
DistanceMatrix() supports amino acid models and more nucleotide models of evolution.
DistanceMatrix() determines maximum likelihood distances given a rate matrix and frequencies.
DistanceMatrix() allows non-uniform costs through specification of a substitution matrix.
DistanceMatrix() can now incorporate distance due to indels into any model of evolution.

Population genetics functionality

New InferDemography() function for fitting demographic histories to site frequency spectra.
New InferRecombination() function for fitting recombination parameters to correlation profiles.
New InferSelection() function for estimating Ka/Ks (dN/dS; omega) from aligned coding sequences.
New "Population Genetics Inference in R" vignette describing these functions.

Alignment improvements

AlignPairs() now outputs a consensus sequence along with aligned pairs when type is "sequences".
AlignPairs() now accepts QualityScaledXStringSet inputs, allowing for merging paired-end reads.
AlignProfiles() parameters were retrained for RNA inputs, affecting other alignment functions.

Version 3.2

General changes

DetectRepeats() is slightly more accurate and has retrained parameters.
Clusterize() is faster when using multiple processors.
Clusterize() allows masking repeats and low complexity regions.
FindSynteny() parameters were retrained.
There is a new as.dist() method for objects of class Synteny.

Phylogenetics improvements

Treeline() now supports balanced minimum evolution (i.e., method="ME") that is accurate, fast, scalable, and the new default method.
The Treeline() function is faster, requires less memory, and has new arguments allowing greater control.
Treeline() now accepts an alignment and/or distance matrix for some methods.
Treeline() uses parsimony to perform ancestral state reconstruction unless method is "ML", in which case it still uses likelihood.
Examples in the vignette and manual page for Treeline() are revised and expanded.
ReadDendrogram() and WriteDendrogram() have `quote` arguments.
Cophenetic() is more scalable.
DistanceMatrix() has a new Poisson correction method.

Search improvements

SearchIndex() can iterate to find more remote homologs.
SearchIndex() performs compositional adjustment of the substitution matrix.
SearchIndex() parameters are retrained.

Alignment improvements

Align*() parameters for AAStringSets were retrained using multiple benchmarks to mitigate overfitting to a single benchmark.
Align*() parameters for RNAStringSets were retrained on a larger set of ncRNA alignments to further improve accuracy.
Some PredictHEC() arguments allow vectorized inputs and the defaults are retrained.

Version 3.0

General changes

AA_REDUCED contains more reduced amino acid alphabets.
BLOSUM and PAM substitution matrices are now included in DECIPHER.
PFASUM and MMLSUM substitution matrices can now be specified conveniently (e.g., "PFASUM50").
Browse*() functions now open new html files for each function call by default.
Browse*() functions now include a `title` argument that adds a title to the output.
DECIPHER no longer depends on the the parallel package.
ScoreAlignment() can return column scores or the sum of column scores.
The "Getting started DECIPHERing" vignette is now more of a general overview.
Parameters perfectMatch and misMatch are trained to best match protein coding (nucleotide) to amino acid sequence alignments.

Sequence database improvements

DECIPHER is now SQL agnostic and works with other SQL drivers (e.g., RMariaDB, RPostgres).
DECIPHER suggests, rather than depends, on RSQLite. This requires running library(RSQLite) separately, if desired.
DECIPHER v3 should be able to read databases built with v2 but not vice versa.
The Add2DB() function should be faster at adding/updating database fields.

Homology finding functionality

New IndexSeqs(subject, ...), SearchIndex(pattern, index, subject=NULL, ...), and AlignPairs(pattern, subject, pairs=NULL, ...) functions.
New vignette, "Searching Biological Sequences for Research".
There is a new `InvertedIndex` class with a print.InvertedIndex() function.
DECIPHER now uses AlignPairs() instead of pairwiseAlignmnent() from Biostrings.

Clusterize improvements

Clusterize() is faster for high distance cutoffs (i.e., low similarity) and more efficiently parallelized.
Clusterize() employs variable length k-mers when comparing sequences. This allows it to maintain sensitivity for sets of input sequences with highly variable lengths.
Clusterize() performs extension of k-mer matches when predicting anchors and estimating similarity.
Clusterize() uses a new reduced amino acid alphabet for rare k-mers.
Clusterize() allows up to 2^31 - 1 (2,147,483,647) input sequences given sufficient memory (and time).
Clusterize() better reports clustering effectiveness, typically resulting in lower estimates.
The "Upsize Your Clustering with Clusterize" vignette is expanded and includes an example of clustering both DNA strands.

DetectRepeats improvements

DetectRepeats() is (re)calibrated to make probability of discovery by chance approximately exp(-Score).
DetectRepeats() has multiple internal improvements to scoring accuracy.
DetectRepeats() uses a different reduced amino acid alphabet.
DetectRepeats() has a new `maxCopies` argument, which allows it to run in cases where there are (very) many repeats (e.g., the complete human genome).
DetectRepeats() corrects the substitution matrix for each sequence's background by default.

FindSynteny improvements

FindSynteny()'s scoring formula is revised, adding `sepPower` and `gapPower` arguments.
FindSynteny() masks low complexity regions in addition to repeats.
FindSynteny() extends k-mer matches to both sides whenever possible.
FindSynteny() has all parameters (re)optimized using a synthetic orthology benchmark.
FindSynteny() uses a new reduced amino acid alphabet when useFrames=TRUE (the default).

Version 2.30

Alignment improvements

AlignSeqs() on RNAStringSet inputs will now compute single-sequence free energies to improve alignment accuracy.
Align*() parameters were retrained using Rfam seed (ncRNAs) and mTM-align (AAs) structural alignments.
Added MMLSUM amino acid substitution matrices, although PFASUM is still used by default.

Clusterize improvements

Clusterize() has a modified algorithm that improves accuracy for approximately the same speed.
Clusterize() has newly trained parameters that balance accuracy and speed.
The "Upsize Your Clustering with Clusterize" vignette has a new figure to help with understanding important parameters.

TreeLine improvements

TreeLine() is faster, more scalable, and more accurate by a modest amount.
TreeLine() now uses SIMD instructions if MAKEVARS is configured (by adding " -O3 -march=native" to PKG_CFLAGS). When enabled, SIMD is not portable and requires recompiling on every machine where used.
TreeLine() outputs a `substitutions` attribute when method="MP".

Version 2.28

General changes

PredictDBN() uses new empirical pseudoenergy rules for RNA folding.
New `deltaGrulesRNA` object containing pseudoenergy rules.
AA_REDUCED includes condensations of PFASUM40.
DetectRepeats() has a `useEmpirical=TRUE` argument controlling whether to use empirically-derived scores to improve detection of tandem repeats.
IdTaxa() is somewhat faster in a subset of classification scenarios.
MapCharacters() works on dendrograms with amino acid ancestral states, not just nucleotides.

Clusterize improvements

Clusterize() has many changes that make it faster, more scalable, and more accurate.
Clusterize() now has a `singleLinkage=FALSE` argument.
The "Upsize Your Clustering with Clusterize" vignette is slightly improved.

Version 2.26

General changes

Clusterize() is a new inexact clustering algorithm with linear time complexity, which replaces IdClusters().
New "Upsize Your Clustering with Clusterize" vignette.
DistanceMatrix() now has multiple methods ("overlap", "shortest", and "longest"). The `peanlizeGapLetterMatches` argument now accepts NA, and there is a new `minCoverage` argument.
Cophenetic() is now faster.
IdLengths() is faster and allows type="AAStringSet".
DetectRepeats() uses empirical signals to improve accuracy when detecting tandem repeats.

Version 2.24

General changes

DistanceMatrix() has an option correction="F81" to correct for unequal state frequencies.
RemoveGaps() has an argument `includeMask=TRUE` that controls how masking ("+") characters are handled.

Alignment improvements

AlignProfiles() has a `standardize=TRUE` argument that will score relative to length rather than absolute scoring.
Align*() functions have re-optimized parameters resulting in relatively small changes.
New ScoreAlignment() function to consolidate scoring alignments across different functions.

DetectRepeats improvements

DetectRepeats() scores solely using a substitution matrix with sum-of-(adjacent-)pairs scoring.
DetectRepeats() underwent parameter (re)optimization.
DetectRepeats() now scores secondary structures in addition to primary sequence.
There is a new vignette showing how to use DetectRepeats(), and new example files of human proteins with tandem repeats.

TreeLine function

New function TreeLine() to build exact phylogenetic trees, which was previously a feature of IdClusters().
TreeLine() now computes correct likelihoods, unlike the original method="ML" implementation in IdClusters().
TreeLine() returns a method="ML" tree (type="dendrogram") by default, unlike the type="clusters" default in IdClusters().
TreeLine() has a new method="MP" for constructing maximum parsimony trees using Sankoff parsimony with a costMatrix (by default binary, which is equivalent to Fitch parsimony).
TreeLine() will accept myXStringSet without `myDistMatrix` when method is "ML" or "MP".
TreeLine() can compute "ML" and "MP" trees for both nucleotides and amino acids.
TreeLine() automatically returns "support" (for "ML" and "MP") and "probability" (for "ML") values as attributes, providing confidences for internal nodes.
MODELS now includes all common time-reversible nucleotide models and 37 time-reversible amino acid models from the literature. It also allows empirical (rather than optimized) state frequencies to be used (with "+F"). It allows for an additional insertion/deletion (gap) state (with "+Indels").
TreeLine() has an argument `quadrature=FALSE` to specify whether the Laguerre quadrature is used to calculate the discrete Gamma ("+G#") distribution of rates across sites. The default is equal binning for equivalence with other tree building programs.
TreeLine() will perform ancestral state reconstruction if reconstruct=TRUE and also provide per site likelihoods (for method="ML") as attributes on each node. For "MP" trees, ancestral states come from the most parsimonious state, whereas for all other tree types the ancestral state is derived from the (marginal) likelihoods at each node.
TreeLine() performs automatic model selection based on the provided model's `informationCriterion` specified by the user (either "AICc" by default, or "BIC").
TreeLine() automatically determines the optimal number of processors when processors is set to a value greater than 1.
TreeLine() supports SIMD when compiled for native architectures that support AVX2 and FMA3 (Intel and AMD chips since ~2013).
TreeLine() has a `maxTime` argument that will return a tree at the next available opportunity after the elapsed time.
TreeLine() returns trees with attributes() showing their score, model parameters, and other information.
There is a new vignette showing how to grow trees with TreeLine().

Version 2.20

General changes

There is a new DetectRepeats() function that will detect tandem and/or interspersed repeats in a sequence.
FindSynteny() will allow the finding of syntenic regions between a genome and itself while blocking the trivial solution of exact matching.
Improvements to the speed of collapsing trees to multi-furcating in IdClusters().
MaskAlignment() now allows specification of whether to use a random (uniform) background rather than inferring the background from the input sequences when calculating entropy.

Non-coding RNA functionality

There is a new LearnNonCoding() function that will train a `NonCoding` model given a set of RNA sequences. This `NonCoding` model can be given to the new FindNonCoding() function along with a genome to find new instances of the RNA sequences.
New vignette named "The Double Life of RNA: Uncovering Non-Coding RNAs".
PredictDBN() now uses free energy predictions when calculating structure, improving accuracy when there is insufficient signal of compensatory mutations.

FindGenes improvements

FindGenes() now accepts an argument `includeGenes` to incorporate the output of FindNonCoding() into the protein coding gene calls.
FindGenes() now trains many codon models, rather than only a few, and uses them all to find genes. This improves the accuracy and consistency of results.
FindGenes() no longer fails to correctly find protein coding genes in high-GC genomes.
FindGenes() now extracts the Shine-Delgarno motif automatically to accommodate genomes with a non-standard motif.
It is now possible to color genes by a property when plotting a Genes object using the `colorBy` argument.
Changes to "The Magic of Gene Calling" vignette to show how to incorporate non-coding RNAs and annotate genes.

Version 2.18

General changes

AmplifyDNA() outputs the name/index of the amplified sequence with its predicted amplicons.
IdClusters() is simplified to be completely k-mer based so it is faster (and more approximate).
IdClusters() has a new `root` argument allowing specification of an outgroup sequence as the root.
New `deltaHrulesRNA` and `deltaSrulesRNA` objects for predicting RNA folding from free energies.
ReadDendrogram() now numbers tree labels in alphabetical order of their labels upon dendrogram import.

Gene calling functionality

New gene finder, FindGenes(), that outputs objects of class `Genes`.
New methods for `Genes` objects, including `[`, `plot`, and `print`.
New ExtractGenes() function for pulling the genes or proteins out of a genome.
New WriteGenes() function for exporting gene predictions in gbk or gff format.
New vignette entitled, "The Magic of Gene Finding".
New genome, that of Chlamydia trachomatis, included for examples.
New named set genes, belonging to the Planctobacteria phylum, included for examples.

IDTAXA improvements

Changes to IdTaxa() to make it faster and work better with amino acid sequences, including a `fullLength` argument to pre-filter classifications based on length.
Changes to LearnTaxa() to make it faster and automatically choose bigger values of K (k-mer length) by default.
AA_REDUCED includes 17 new reduced alphabets optimized for protein classification.
New section of the "Classifying Sequences" vignette to describe the functional classification of protein sequences.
Taxa objects can be subset by rank level and higher values of `threshold` than were used in IdTaxa().

Version 2.12

General changes

AlignTranslation() now accepts multiple genetic codes.
BrowseSeqs() now has an example for how to color codons by amino acid.
PredictDBN() now has an example of how to plot RNA arc diagrams.

Version 2.10

General changes

PredictDBN() has a new type of output ("evidence").
FindSynteny() default parameters are (re)optimized.
Includes the ability to plot pie charts of IdTaxa outputs with different "n" per group. Genus and species names are now italicized.
New default parameters for TrimDNA() trained on Illumina sequences.

Version 2.8

General changes

New reduced amino acid alphabets included in AA_REDUCED.
FindSynteny() default parameters are (re)optimized.
New option of searching for additional RNA secondary structure with PredictDBN(..., type="search").
DistanceMatrix() and IdClusters() now can return/use (respectively) objects of class "dist" and unlimited size.
Speed and efficiency improvments to IdClusters().
plot.Taxa() now makes prettier plots of objects of class "Taxa" "Test".
Improvements to FormGroups() to better work with FindChimeras().