DECIPHER logo

  • Alignment▸
  • Classification▸
  • Homology▾
  • Clusterize
  • Detect Repeats
  • Find Chimeras
  • Sequence Search
  • Oligo Design▸
  • Phylogenetics▸
  • Tutorials▸
  • Home
  • News
  • Downloads
  • Contact
  • Citations

Clusterize

This short example describes how to use Clusterize to cluster sequences, as described in:

ES Wright (2024) "Accurately clustering biological sequences in linear time by relatedness sorting." Nature Communications, doi:10.1038/s41467-024-47371-9.

For an in-depth tutorial on clustering, see the "Upsize Your Clustering with Clusterize" vignette, available from the Documentation page. Also, watch the video explaining Clusterize here.

How do I cluster sequences by similarity?

First it is necessary to install DECIPHER and load the library in R. Next, set the "fas" variable to the path to the FASTA file of unaligned sequences (e.g., "~/mySeqs.fas"). Then you can choose a distance cutoff for clustering the sequences. Clusterize will output a cluster number for each input sequence and print an estimate of the clustering effectiveness.

Show output
12-34-567-89












-1011121314


















-1516






-1718
1920
21


# load the DECIPHER library in Rlibrary(DECIPHER) # specify the path to the FASTA file (in quotes)fas <- "<<REPLACE WITH PATH TO FASTA FILE>>" # load the sequences from the file# change "DNA" to "RNA" or "AA" as neededseqs <- readAAStringSet(fas) # look at some of the sequences (optional)seqsAAStringSet object of length 18976: width seq names [1] 567 MPYMGV...RRVPPK Seq1 [2] 749 MRYIDD...MNQIES Seq2 [3] 849 MLGILK...FGEKGT Seq3 [4] 742 MLFSFS...IKEQNS Seq4 [5] 499 MSSFTL...SAVSSL Seq5 ... ... ...[18972] 927 MSRKVL...RGTDNE Seq18972[18973] 465 MTFEER...GDDASF Seq18973[18974] 502 MRTPKS...PHKTSV Seq18974[18975] 527 MFFVPR...PGAAHS Seq18975[18976] 475 MNRGRR...DLPARL Seq18976 # cluster the sequencesclusters <- Clusterize(seqs,cutoff=0.5, # < 50% distantminCoverage=0.5, # > 50% coverageprocessors=NULL) # use all CPUsPartitioning sequences by 4-mer similarity: |========================================| 100%
Time difference of 6.05 secs
Sorting by relatedness within 15809 groups:iteration 7 of up to 24 (100.0% stability)
Time difference of 1.46 secs
Clustering sequences by 4-mer to 6-mer similarity: |========================================| 100%
Time difference of 52.73 secs
Clusters via relatedness sorting: 86.8% (0.3% exclusively)Clusters via rare 4-mers: 99.7% (13.2% exclusively)Estimated clustering effectiveness: 99.2%
# view the cluster numbershead(clusters) clusterSeq1 6306Seq2 1957Seq3 3093Seq4 4164Seq5 7527Seq6 1944 # compute cluster statisticsmax(clusters) # number of clusters[1] 12559t <- table(clusters)mean(t) # average cluster size[1] 1.510948tail(sort(t)) # biggest clusterscluster7451 8479 6757 2279 3414 6 47 49 51 52 64 110