Datasets for testing

AF091148.fsa is a tiny dataset with 180,704 nt in 1,403 seqs, from 103 to 137 nt (avg 129).

BioMarKs.fsa is a larger dataset with 119,117,805 nt in 312,503 seqs, from 2 to 532 nt (avg 381). It is compressed with gzip due to GitHub's 100MB file size limit.

constaint.fsa.bz2 - fasta file for constraint checking, i.e. special cases such as 2048-long reads that can break the code

PR2-18S-rRNA-V4.fsa is the latest 18S rRNA V4 reference from PR2 (http://ssu-rrna.org/).

Rfam_9_1.fasta is a dataset with 33,931,362 nt in 192,445 seqs, from 20 to 1,250 nt (avg 176). It is Rfam release v9.1 (Gardner et al., 2009). It was used for USEARCH testing in Edgar's paper. Source: ftp://ftp.ebi.ac.uk/pub/databases/Rfam/9.1/Rfam.fasta.gz

Rfam_11_0.fasta is a dataset with 52,588,875 nt in 380,919 seqs, from 19 to 1,875 nt (avg 138). It is Rfam release v11.0 (Gardner et al., 2009). It was used for USEARCH testing (http://drive5.com/usearch/benchmark_rfam.html). Source: ftp://ftp.ebi.ac.uk/pub/databases/Rfam/11.0/Rfam.fasta.gz