Calculate CAZy family sequence diversity (Tutorial)

This page includes a tutorial or example walkthrough to explore the sequence diversity in the CAZy family PL20.

Note

All output directories are created by cazomevolve.

1. Build a local CAZyme database

Build a local CAZyme database containing all CAZyme records from CAZy using cazy_webscraper.

cazy_webscraper dummyemail@domain.com \
    -o cazy/cazy.db

2. Get protein sequences

Download protein sequences from NCBI for all proteins listed in CAZy family PL20, and write out the protein sequences to a multisequence FASTA file in the directory pl20_seqs.

Specifically, cazomevolve coordinates cazy_webscraper to do this.

A multisequence FASTA file is created per CAZy family by cazomevolve get_fam_seqs, which the standard name format for <FAM IN CAZY FORMAT>.seqs.fasta, e.g. PL20.seqs.fasta.

cazomevolve get_fam_seqs \
    dummyemail@domain.com \
    cazy/cazy.db \
    PL20 \
    pl20_seqs \
    -f

3.A Run BLASTP

Run an all-versus-all sequence comparison analysis using BLASTP from NCBI+.

cazomevolve run_fam_blast \
    pl20_seqs/PL20.seqs.fasta \
    blast_output/pl20_blastp

3.B Run DIAMOND

Alternatively, run an all-versus-all sequence comparison analysis using DIAMOND, which is practically a significantly faster version of BLAST and is recommended for data sets (e.g. +1000 seqs).

Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015 Jan;12(1):59-60. doi: 10.1038/nmeth.3176. Epub 2014 Nov 17. PMID: 25402007.

Use the run_fam_diamond subcommand, using three inputs: 1. Path to the FASTA file of protein sequences 2. A Path to create a DIAMOND database 3. A Path to write out the DIAMOND output in tab format

cazomevolve run_fam_diamond \
    pl20_seqs/PL20.seqs.fasta  \
    diamond/diamond.db \
    testing/diamond/diamond_output

Note

The DIAMOND database and DIAMOND output file do not have to be assigned to the same output directory.

Visualise sequence diversity

Use the functions from the cazomevolve.seq_diversity.explore submodule in a Python script or jupyter notebook to visualise and interrogate the sequence diversity data.

More inforamtion can be found on the `Explore sequence diversity in CAZy families`_. page

An example jupyter notebook can be found here and which can be used as a template.