Welcome to cazomevolve’s documentation!
Build Information
PyPI
cazomevolve
cazomevolve
(“cazome-evolve”) is a Python3 package for the automated annotation and exploratory
analysis of the CAZyme complements (CAZomes) for a set of species and/or genomes of interest.
Carbohydrate Active enZymes are a subset of proteins that generate, modify and/or degrade carbohydrates. These enzymes are pivotal in many biological processes, including energy metabolism, cell structure, signalling and pathogen recognition. Therefore, these enzymes are of significant biological and industrial interest.
CAZy (www.cazy.org) is the most comprehensive CAZyme database, grouping proteins by sequence similarity
into CAZy families. Therefore, CAZy family annotations correspond to presumed shared mechanism and
structural fold. cazomevolve
using the CAZy family annotations to standardise summarising and
capturing the range of CAZyme functions within a CAZome, which enables the comparison of the CAZomes
across a set of genomes in order to identify groups of CAZy families of biological and/or industrial interest.
cazomevolve
can also automate the process of exploring sequence diversity in an individual CAZy family
or across a set of CAZy families of interest. This enables the systematic identification of CAZymes that are potnetially not yet functionally
or structurally represented in CAZy, as well as exploring relationships across CAZy families.
Use cazomevolve
to explore:
CAZome sizes:
Compare the number of CAZymes and CAZy families
Calculate the proportion of the proteome encompassed by the CAZome
Compute the CAZy family to CAZyme ratio
CAZy class frequencies:
Calculate the number of CAZymes per CAZy class
Plot a proportional area plot of CAZy class frequency broken down by CAZy class and user defined group deliniations (e.g. by genus or species)
CAZy families:
Explore sequence diversity within a set of CAZy families: * Run all-vs-all sequence comparison analyses * Cluster the sequences by degree of sequence similarity * Generate clustermaps of sequence identity, BLAST-score ratio, and coverage
Explore CAZy family frequencies: * Compute the number of CAZymes per CAZy family * Identify lineage or group specific families - e.g. genus or species specific families * Identify core CAZomes - families that appear in all genomes * Cluster genomes by CAZy family frequencies using hierarchical clustering: * Generate annotated clustermaps of CAZy family frequencies * Build a dendogram using distances calculated from the CAZome composition * Construct tanglegrams to compare CAZy family dendrogram to a ANI-dendrogram or phylogenetic tree
Always co-occurring families:
Identity CAZy families that are always present in the CAZome together
Find lineage or group specific groups of co-occurring families
Construct an upset plot of co-occurring families
Calculate the number of genomes each group of co-occurring families appear in
Principal component analysis (PCA):
Use PCA to idenify overal trends in the large and complex data set
Project genomes onto use selected principal components (PCs)
Construct loadings plots to explore correlation between CAZy families and PCs
Identify relationships between groups of CAZy families
Explore associations between CAZy families and lineage, phenotype, and niche adaptation
Co-evolving CAZy families:
Generate the input file tab delimited list of genomes and CAZy families required by [coinfinder]() (Whelan _et al._)
Optionally add taxonomic data to the tab delimited list, to include taxa in the coinfinder output
Reconstruct phylogenetic trees to be used as input by coinfinder: * Reconstruct a multi-gene phylogenetic tree using [RaxML-ng]() * Construct an ANI-based dendrogram
Use coinfinder to identify CAZy families that appear together in a genome more often than expected by lienage and chance
Example of use and application
An example of using cazomevovle
to explore relationships and compare the CAZomes of a diverse set of bacteria, can be found within the
Supplementary Information Hobbs et al., _Pectobacteriaceae_ repository.
This repository houses the code, data and analyses aurgmented and completed during the exploration of +700 _Pectobacteriaceae_ CAZomes, and where a potential association between the composition of the CAZome and the plant host range of these plant pathogens was identified.
Subcommand summary
cazomevolve
is configured through a series of subcommands. Below is a list of these subcommands
(excluding required and optional arguments) included in cazomevolve
.
Explore CAZy family sequence diversity
get_fam_seqs
- retrieve the protein sequences for all proteins in a CAZy family of interest.
run_fam_blast
and run_fam_diamond
- perform an all-versus-all sequence comparison analysis using BLAST or DIAMOND.
Use functions available in the cazomevolve.seq_diversity.explore
module to explore the sequence diversity.
Annotate CAZomes
Download genomes
Use download_acc_genomes
to download the genomes for the genomic version accessions listed in a
plain text file.
Use download_genomes
to download all genomes associated with a search term in the NCBI Assembly database.
For example, retrieve all genomes associated with the term “_Pectobacteriaceae_” in NCBI to retrieve all
_Pectobacteriaceae_ genomes.
Retrieve CAZy annotations
Build a local CAZyme database containing all CAZyme records from the CAZy database using build_cazy_db
.
Then extract the CAZy family annotations from the local CAZyme for annotated protein seqences in a set of assembly files (specifically
the proteome FASTA files) using get_cazy_cazymes
.
Get dbCAN predicted CAZy family annotations
Use run_dbcan
to automate running dbCAN (versions 2, 3, or 4) over a set of proteome FASTA file. Parse
the output from dbCAN to extract the consensus CAZy family annotations (i.e. annotations that at leasst two
tools agree upon) using get_dbcan_cazymes
.
Explore and compare CAZomes
Import and implement functions available in the cazomevolve.cazome.explore
module to explore and compare:
CAZome sizes
CAZy class distributions
CAZy family frequencies
Lineage specific CAZy families
Groups of CAZy families that always co-occur together
Perform Principal Component Analysis (PCA) to explore trends across the data set
Installation
The easiest way to install cazomevolve
is via PyPi.
pip install cazomevolve
Alternatively, cazomevolve
can be installed from source:
git clone https://github.com/HobnobMancer/cazomevolve.git
pip install -e cazomevolve/.
Documentation
Citing and citations
If you use cazomevolve
in your work _please_ cite our work (including the provided DOI), as well as
the specfic version of the tool you used. This is not only helpful to us as the developers to get our
work out into the world, but it is also essential for the reproducibility and integrity of scientific research.
Citation:
Hobbs, Emma. E. M., Gloster, Tracey, M., Pritchard, Leighton (2023) cazomevolve, _GitHub_. DOI: 10.5281/zenodo.6614827
cazomevovle
depends on a number of tools. To recognise the contributions that the
authors and developers have made, please also cite the following:
CAZy:
cazomevolve
uses the CAZy family classifications establised and curated by the CAZy database.
Elodie Drula and others, The carbohydrate-active enzyme database: functions and literature, Nucleic Acids Research, Volume 50, Issue D1, 7 January 2022, Pages D571–D577, https://doi.org/10.1093/nar/gkab1045
cazy_webscraper:
Hobbs, E. E. M., Gloster, T. M., and Pritchard, L. (2022) ‘cazy_webscraper: local compilation and interrogation of comprehensive CAZyme datasets’, bioRxiv, https://doi.org/10.1101/2022.12.02.518825
For additional citations when using cazy_webscraper
, see the cazy_webscraper
documentation.
ncbi-genome-download:
Blin et al. (2017) ncbi-genome-download, https://github.com/kblin/ncbi-genome-download
dbCAN:
dbCAN version 2:
Zhang H, Yohe T, Huang L, Entwistle S, Wu P, Yang Z, Busk PK, Xu Y, Yin Y. dbCAN2: a meta server for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 2018 Jul 2;46(W1):W95-W101. doi: 10.1093/nar/gky418. PMID: 29771380; PMCID: PMC6031026.
dbCAN version 3:
If using dbCAN version 3, cite the publication for version 2 as well as eCAMI:
Xu J, Zhang H, Zheng J, Dovoedo P, Yin Y. eCAMI: simultaneous classification and motif identification for enzyme annotation. Bioinformatics. 2020 Apr 1;36(7):2068-2075. doi: 10.1093/bioinformatics/btz908. PMID: 31794006.
dbCAN version 4:
Zheng J, Ge Q, Yan Y, Zhang X, Huang L, Yin Y. dbCAN3: automated carbohydrate-active enzyme and substrate annotation. Nucleic Acids Res. 2023 May 1:gkad328. doi: 10.1093/nar/gkad328. Epub ahead of print. PMID: 37125649.
BLAST Score Ratio:
Rasko DA, Myers GS, Ravel J. Visualization of comparative genomic analyses by BLAST score ratio. BMC Bioinformatics. 2005 Jan 5;6:2. doi: 10.1186/1471-2105-6-2. PMID: 15634352; PMCID: PMC545078.
DIAMOND
Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015 Jan;12(1):59-60. doi: 10.1038/nmeth.3176. Epub 2014 Nov 17. PMID: 25402007.
BLAST
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. Journal of molecular biology. 1990;215(3):403–10.
Explore CAZy family sequence diversity and CAZomes:
cazomevolve
uses several packages to visualise and interrogate the dataset.
Waskom, M. L. (2021) seaborn: statistical data visualization, Journal of Open Source Software, 6(60), pp. 3021
Virtanen et al. (2020), SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python, Nature Methods, 17, pp.261-272
Pedregosa et al., (2011), Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, pp.2825-2830.
Development and issues
If there are additional features you wish to be added, you have problems with the tool, or would like to contribute, please raise an issue at the GitHub repository.
Issues page: https://github.com/HobnobMancer/cazomevolve/issues.