Calculate CAZy family sequence diversity (Usage)
cazomevolve can be used to explore the sequence diversity across an individual CAZy family, or
a group of families.
This page summarises the subcommands (including required and optional arguments), needed to coordinate
cazomevovle to explore the seqence diversity across a set of CAZy families of interest.
Note
All output directories are created by cazomevolve.
Example
An example jupyter notebook demonstrated how to use cazomevolve to explore the sequence diversity in the
CAZy family PL20 can be found in the `GitHub repository<https://github.com/HobnobMancer/cazomevolve/blob/master/cazomevolve/seq_diversity/explore_seq_diversity.ipynb>`_ for cazomevolve, and can be used as a template.
The contents of the notebook can also be viewed in this documentation.
1. Build a local CAZyme database
Prior to exploring the sequence diversity in a family of interest, a local CAZyme database
needs to be created using cazy_webscraper.
See the cazy_webscraper documentation for full details of operation.
Build a local CAZyme database containing all CAZyme records from CAZy using cazy_webscraper.
cazy_webscraper dummyemail@domain.com \
-o cazy/cazy.db
2. Get protein sequences
Download protein sequences from NCBI for all proteins listed in the CAZy families of interest using
get_fam_seqs subcomamnd. Specifically, cazomevolve coordinates cazy_webscraper to do this.
- positional arguments:
email User email address (Required by NCBI) cazy Path to local CAZyme db createed by cazy_webscraper families Families to retrieve, separated by single comma e.g ‘GH1,PL2,CE3’ outdir Path to dir to write out FASTA file
- optional arguments:
- -h, --help
show this help message and exit
- -f, --force
Force file over writting (default: False)
- -n, --nodelete
enable/disable deletion of exisiting files (default: False)
3. Run all-versus-all seq analysis
Use BLASTP (via NCBI+) or DIAMOND to run a all-vs-all sequence comparison.
For large data sets of +1000 sequences, we recommend using DIAMOND, which is a significantly faster version of BLAST.
Run an all-versus-all sequence comparison analysis using BLASTP from NCBI+.
Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015 Jan;12(1):59-60. doi: 10.1038/nmeth.3176. Epub 2014 Nov 17. PMID: 25402007.
For BLASTP use the run_fam_blast subcommand.
- positional arguments:
fasta Path to fasta file of protein seqs outfile Path to write out output file
- optional arguments:
- -h, --help
show this help message and exit
For DIAMOND use the run_fam_diamond subcommand.
- positional arguments:
fasta Path to fasta file of protein seqs diamond_db Path to create diamond DB outfile Path to write out output file
- optional arguments:
- -h, --help
show this help message and exit
Visualise sequence diversity
Use the functions from the cazomevolve.seq_diversity.explore submodule in a Python script
or jupyter notebook to visualise and interrogate the sequence diversity data.
More inforamtion can be found on the :ref:`explore sequence diversity in CAZy families`_. page
An example jupyter notebook can be found here and which can be used as a template.