Calculate CAZy family sequence diversity (Usage)

cazomevolve can be used to explore the sequence diversity across an individual CAZy family, or a group of families.

This page summarises the subcommands (including required and optional arguments), needed to coordinate cazomevovle to explore the seqence diversity across a set of CAZy families of interest.

Note

All output directories are created by cazomevolve.

Example

An example jupyter notebook demonstrated how to use cazomevolve to explore the sequence diversity in the CAZy family PL20 can be found in the `GitHub repository<https://github.com/HobnobMancer/cazomevolve/blob/master/cazomevolve/seq_diversity/explore_seq_diversity.ipynb>`_ for cazomevolve, and can be used as a template.

The contents of the notebook can also be viewed in this documentation.

1. Build a local CAZyme database

Prior to exploring the sequence diversity in a family of interest, a local CAZyme database needs to be created using cazy_webscraper.

See the cazy_webscraper documentation for full details of operation.

Build a local CAZyme database containing all CAZyme records from CAZy using cazy_webscraper.

cazy_webscraper dummyemail@domain.com \
    -o cazy/cazy.db

2. Get protein sequences

Download protein sequences from NCBI for all proteins listed in the CAZy families of interest using get_fam_seqs subcomamnd. Specifically, cazomevolve coordinates cazy_webscraper to do this.

positional arguments:

email User email address (Required by NCBI) cazy Path to local CAZyme db createed by cazy_webscraper families Families to retrieve, separated by single comma e.g ‘GH1,PL2,CE3’ outdir Path to dir to write out FASTA file

optional arguments:

-h, --help: show this help message and exit
-f, --force: Force file over writting (default: False)
-n, --nodelete: enable/disable deletion of exisiting files (default: False)

3. Run all-versus-all seq analysis

Use BLASTP (via NCBI+) or DIAMOND to run a all-vs-all sequence comparison.

For large data sets of +1000 sequences, we recommend using DIAMOND, which is a significantly faster version of BLAST.

Run an all-versus-all sequence comparison analysis using BLASTP from NCBI+.

Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015 Jan;12(1):59-60. doi: 10.1038/nmeth.3176. Epub 2014 Nov 17. PMID: 25402007.

For BLASTP use the run_fam_blast subcommand.

positional arguments:

fasta Path to fasta file of protein seqs outfile Path to write out output file

optional arguments:

-h, --help: show this help message and exit

For DIAMOND use the run_fam_diamond subcommand.

positional arguments:

fasta Path to fasta file of protein seqs diamond_db Path to create diamond DB outfile Path to write out output file

optional arguments:

-h, --help: show this help message and exit

Visualise sequence diversity

Use the functions from the cazomevolve.seq_diversity.explore submodule in a Python script or jupyter notebook to visualise and interrogate the sequence diversity data.

More inforamtion can be found on the :ref:`explore sequence diversity in CAZy families`_. page

An example jupyter notebook can be found here and which can be used as a template.