Clusterize
This short example describes how to use Clusterize to cluster sequences, as described in:ES Wright (2024) "Accurately clustering biological sequences in linear time by relatedness sorting." Nature Communications, doi:10.1038/s41467-024-47371-9.
How do I cluster sequences by similarity?
First it is necessary to install DECIPHER and load the library in R. Next, set the "fas" variable to the path to the FASTA file of unaligned sequences (e.g., "~/mySeqs.fas"). Then you can choose a distance cutoff for clustering the sequences. Clusterize will output a cluster number for each input sequence and print an estimate of the clustering effectiveness.12-34-567-89-1011121314-1516-1718192021# load the DECIPHER library in Rlibrary(DECIPHER) # specify the path to the FASTA file (in quotes)fas <- "<<REPLACE WITH PATH TO FASTA FILE>>" # load the sequences from the file# change "DNA" to "RNA" or "AA" as neededseqs <- readAAStringSet(fas) # look at some of the sequences (optional)seqs # cluster the sequencesclusters <- Clusterize(seqs,cutoff=0.5, # < 50% distantminCoverage=0.5, # > 50% coverageprocessors=NULL) # use all CPUs # view the cluster numbershead(clusters) # compute cluster statisticsmax(clusters) # number of clusterst <- table(clusters)mean(t) # average cluster sizetail(sort(t)) # biggest clusters