Explore sequence diversity in CAZy families (Usage)

To visualise the data generated from an all-versus-all sequence alignemnt analysis (from DIAMOND or BLAST), cazomevolve includes several functions that can be imported into a Python script of jupyter notebook.

These functions, including their imports and use, are described below. http://www.cazy.org/PL1_characterized.html

You can find a tutorial for exploring sequence diversity in CAZy families on the next page.

Load and parse DIAMOND and BLAST output

First load the data into a dataframe, and calculate the BLAST Score Ratio (BSR).

The BSR is the bitscore normalised for protein length, and enables us to compare alignments of pairs of proteins of different lengths.

Rasko DA, Myers GS, Ravel J. Visualization of comparative genomic analyses by BLAST score ratio. BMC Bioinformatics. 2005 Jan 5;6:2. doi: 10.1186/1471-2105-6-2. PMID: 15634352; PMCID: PMC545078.

Import from cazomevolve.seq_diversity.explore.data.

Load the data by passing the path (as a string) to the DIAMOND/BLAST output file and the name of the family/families of interest (as a string) to the function load_data().

def load_data(data_file, fam):
    """Load the output data from BLASTP+/diamond into a pandas dataframe, and calculate the BLAST score ratio.

    :param data_file: str, path to the DIAMOND output file
    :param fam: str, name of CAZy family

    Return a pandas dataframe
    """

Optionally, remove redundnant proteins using the function remove_redundant_prots(). To identify proteins listed in the structure and characterised table in CAZy, see the section “Get CAZy family data” below.

def remove_redunant_prots(df, fam, candidates={}, structured_prots={}, characterised_prots={}):
    """Identify groups of identical proteins, and retaining only one member per group

    If providing candidates, structure_prots, and/or characterised all proteins in these dicts
    will be kept even if protein seqs are identical.

    :param df: pandas df containing data from DIAMOND
    :param fam: str, name of CAZy family - group name in the CANDIDATES df

    :param candidates: dict {fam: [prot accessions of proteins of interest]}
    :param structured_prots: dict {fam: [prot acc of proteins listed in the structure table in CAZy]}
    :param structured_prots: dict {fam: [prot acc of proteins listed in the characterised table in CAZy]}

    return df
    """

remove_redundant_prots() takes two positional arguments:

The dataframe of DIAMOND/BLAST data
The names of the families/family of interest

To retain all proteins of interest (i.e. candidates), and structurally and/or functionally characterised proteins, provides these proteins to the respective key words candidates, structured_prots (struturally characterised proteins), and characterised_prots (characterised proteins, i.e. functionally characterised proteins).

To automate identifying functionally and structurally characterised proteins, see the section ‘Get CAZy family data’ below.

Get CAZy family data

Use the following functions for retrieving data about the CAZy family are imported from the cazomevolve.seq_diversity.explore.cazy module.

Get CAZy family protein accessions

Import from cazomevolve.seq_diversity.explore.cazy.

The function get_cazy_proteins retrieves the protein IDs from a FASTA file of protein sequences.

def get_cazy_proteins(fasta_file):
    """Retrieving NCBI protein accessions from FASTA file of CAZy proteins

    :param fasta_file: Path to fasta file of CAZy fam protein seqs

    Return list of NCBI protein accessions
    """
    prot_accs = []

    for record in SeqIO.parse(fasta_file, "fasta"):
        prot_accs.append(record.id)

    return list(set(prot_accs))

Get CAZy characterised proteins

Get a list of NCBI protein accessions for proteins listed on the CAZy family’s ‘characterised’ and/or ‘structure’ tables using the function get_cazy_db_prots(), and chose whether to retrieved proteins from the ‘Characterised’ and/or ‘Structure’ table.

Call get_cazy_db_prots(), and provide it the name of the family of interst.

Note

get_cazy_db_prots() must be called individually for each CAZy family of interest, because the CAZy family of interest is used to compile the correct URL to scrape data from CAZy

Warning

Provide the family name in the standard CAZy family format, therefore all letters must be capitalised. E.g. ‘GH1’ not ‘Gh1’ or ‘gh1’

To retrieve protein IDs from the ‘characterised’ table for the CAZy family on the CAZy website, set characterised to true.

# retrieve data from the characterised table for family PL1
get_cazy_db_proteins("PL1", characterised=True)

To retrieve protein IDs from the ‘structure’ table for the CAZy family on the CAZy website, set structured to true.

# retrieve data from the structured table for family PL3
get_cazy_db_proteins("PL3", structured=True)

You can retrieve data from the structured and characterised tables at the same time by setting both keywords to true:

# retrieve data from the characterised and structured tables for family CE5
get_cazy_db_proteins("CE5", characterised=True, structured=True)

Import from cazomevolve.seq_diversity.explore.cazy.

def get_cazy_db_prots(cazy_family, characterised=False, structured=False):
    """Get the NCBI protein accessions for proteins in the structure or characterised tables
    from the CAZy website.

    :param cazy_family: str, name of CAZy family in CAZy format, e.g. GH1 not gh1
    :param characterised: bool, retrieved proteins listed as 'characterised' in CAZy
    :param structured: bool, retrieve proteins listed with structures in CAZy

    Return list of NCBI protein accessions or None if fails
    """

Add the returned lists to a dictionary keyed by CAZy family names and valued by list of associated protein IDs.

struc_prots = {}
struc_prots['PL1'] = get_cazy_db_proteins("PL1", structured=True)

Build plots

You can use cazomevolve to assist in building plots to explore the sequence diversity across a CAZy family of interest.

Clustermap

Clustermaps are a species type of heatmap, blocks in the heatmap that have a similar score to one another are clustered together. This can help indicate clusters of proteins with similar proteins sequences to one another.

cazomevolve using the Python package Seaborn to build clustermaps, which uses the Python package Scipy (version 1.10.0) hierarchical cluster method for clustering the data.

We recommend using the clustermap for the BLAST Score Ratios.

Import from cazomevolve.seq_diversity.explore.plot.

Use the function plot_clustermap() to build clustermaps. The function takes 3 positional arguments:

Dataframe of BLAST/DIAMOND output
Name of the CAZy family/families
Name of the column in the dataframe containing the data to be plotted, e.g. ‘BSR’

To add additional row colours, added dictionaries of candidate proteins, structurally characterised proteins, and functionally characterised proteins to the candidates={}, structured_prots={}, and characterised_prots={} key words respectively.

These dictionaries are keyed by the name of the CAZy family and valued by a list of protein IDs.

def plot_clustermap(
    df,
    fam,
    varaible,
    title=None,
    colour_scheme='rocket_r',
    fig_size=(25, 25),
    save_fig=None,
    dpi=100,
    annotate=False,
    char_only=False,
    candidates={}, structured_prots={}, characterised_prots={},
    palette_dict=PALETTE_DICT,
):
    """Plot a cluster map for the specified variable

    :param df: pandas dataframe
    :param fam: str, CAZy family of interest
    :param variable: df, name of column containing the variable to plot
    :param title: str, default none. Title of plot
    :param colour_scheme: str, default rocket_r, seaborn colour scheme for plot
    :param fig_size: tuple, len 2, default (25, 10)
    :param save_fig: str, path to save file, default none, don't save fig
    :param dpi: int, default 100, resolution of saved file image
    :param annotate: bool, add annotation of protein candidates, and functionally/structurally
        characteirsed proteins
    :param char_only: bool, if set to true, only plot proteins labelled as candidates or
        functionally/structurally characteirsed proteins
    :param candidates: dict {fam: [prot accessions of proteins of interest]}
    :param structured_prots: dict {fam: [prot acc of proteins listed in the structure table in CAZy]}
    :param characterised_prots: dict {fam: [prot acc of proteins listed in the characterised table in CAZy]}

    Return seaborn plot
    """

Heatmap

To generate a heatmap with proteins plotted in the same order as the clustermap generated by plot_clustermap but plotting a different variable (e.g. plotting the query coverage or percentage identity while listing the proteins in the same order as they appear in BLAST Score Ratio clustermap) use the function plot_heatmap_of_clustermap.

Import from cazomevolve.seq_diversity.explore.plot.

def plot_heatmap_of_clustermap(
    fig,
    df,
    fam,
    varaible,
    title=None,
    colour_scheme='rocket_r',
    fig_size=(25, 25),
    save_fig=None,
    dpi=100,
    annotate=False,
    char_only=False,
    candidates={}, structured_prots={}, characterised_prots={},
    palette_dict=PALETTE_DICT,
):
    """Generate a heatmap for the defined variable, with proteins plotted in the same order as the provided
    clustermap (fig)

    :param fig: seaborn clustergrid of entire family, default None, clustermap,
    :param df: pandas dataframe
    :param fam: str, CAZy family of interest
    :param variable: df, name of column containing the variable to plot
    :param title: str, default none. Title of plot
    :param colour_scheme: str, default rocket_r, seaborn colour scheme for plot
    :param fig_size: tuple, len 2, default (25, 10)
    :param save_fig: str, path to save file, default none, don't save fig
    :param dpi: int, default 100, resolution of saved file image
    :param annotate: bool, add annotation of protein candidates, and functionally/structurally
        characteirsed proteins
    :param char_only: bool, if set to true, only plot proteins labelled as candidates or
        functionally/structurally characteirsed proteins
    :param candidates: dict {fam: [prot accessions of proteins of interest]}
    :param structured_prots: dict {fam: [prot acc of proteins listed in the structure table in CAZy]}
    :param characterised_prots: dict {fam: [prot acc of proteins listed in the characterised table in CAZy]}

    Return nothing
    """

The default palette used to annotate, candidate, characterised and structurally characterised proteins is defined in PALETTE_DICT:

# define the colour palettes for annotating proteins
PALETTE = sns.color_palette(['#425df5', '#eb8913', '#19bfb4', '#db0d4e', '#15ab62', '#ffffff'])
PALETTE_DICT = {
    'cand': '#425df5',  # candidates
    'struct': '#eb8913',  # protein with structures in RCSB PDB
    'structCand': '#19bfb4',  # candidates with structures in RCSB PDB
    'func': '#db0d4e',  # candidates listed as 'characterised' in CAZy
    'funcCand': '#15ab62',  # proteins listed as 'characterised' in CAZy
    'nothing': '#ffffff',  # nothing to note about this protein
}