AAlogo.get_df_logo

AAlogo.get_df_logo(df_parts, labels=None, label_test=1, tmd_len=None, start_n=True, characters_to_ignore='.-', pseudocount=0.0)[source]

Compute a sequence logo matrix for the provided sequence parts.

For each residue position, the relative frequency (or another encoding) of each amino acid is computed across all sequences. If variable-length target middle domain (TMD) sequences are provided, they are aligned to a uniform length via N- or C-terminal padding before computing the logo.

Added in version 1.1.0.

Parameters:

df_parts (pd.DataFrame, shape (n_samples, n_parts)) – Sequence parts DataFrame with at least one column from the standard parts: jmd_n, tmd, jmd_c. Must be a valid parts DataFrame.
labels (array-like, shape (n_samples,), optional) – Class labels for samples in df_parts. If provided, only samples with label_test are included in the logo computation.
label_test (int, default=1) – Class label of the test group to select from labels.
tmd_len (int, optional) – Fixed length (>=1) to align all TMD sequences. If None, the maximum TMD length in df_parts is used. Only relevant if tmd column is present.
start_n (bool, default=True) –
Alignment direction for variable-length TMDs:
- True: Align from N-terminus (C-terminal padding with gaps).
- False: Align from C-terminus (N-terminal padding with gaps).
characters_to_ignore (str, default='.-') – Characters excluded from the logo matrix computation.
pseudocount (float, default=0.0) – Pseudocount (>=0) added to all amino acid counts to avoid log(0) issues.

Returns:

df_logo – Logo matrix with residue positions as rows and amino acids as columns.

Return type:

pd.DataFrame, shape (n_positions, n_amino_acids)

See also

AAlogo.get_df_logo_info(): for per-position information content.
logomaker.alignment_to_matrix: the underlying matrix computation function.

Examples

The AALogo.get_df_logo method computes a sequence logo matrix from a df_parts DataFrame using logomaker.alignment_to_matrix. Each row is a residue position, each column an amino acid. The encoding is controlled by the logo_type set at initialization.

import warnings
warnings.filterwarnings('ignore')
import aaanalysis as aa
import logomaker
import matplotlib.pyplot as plt

aa.plot_settings()

sf = aa.SequenceFeature()
df_seq = aa.load_dataset(name="DOM_GSEC", n=100)
labels = df_seq["label"].values
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["jmd_n", "tmd", "jmd_c"])
df_parts.head(3)

	jmd_n	tmd	jmd_c
entry
Q14802	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS
Q86UE4	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR
Q969W9	FQSMEITELE	FVQIIIIVVVMMVMVVVITCLLS	HYKLSARSFI

Controls how amino acid frequencies are encoded. It is passed directly to logomaker.alignment_to_matrix(to_type=logo_type):

'probability' (default): each position sums to 1
'counts': raw amino acid counts
'weight': log-odds weight matrix
'information': information content in bits

The effect is visible in the column sums per position:

for logo_type in ["probability", "counts", "weight", "information"]:
    df_logo = aa.AAlogo(logo_type=logo_type).get_df_logo(df_parts=df_parts)
    col_sum = df_logo.sum(axis=1)  # sum across amino acids per position
    print(f"logo_type='{logo_type}': shape={df_logo.shape}, "
          f"col_sum min={col_sum.min():.3f}, max={col_sum.max():.3f}")

logo_type='probability': shape=(43, 20), col_sum min=1.000, max=1.000
logo_type='counts': shape=(43, 20), col_sum min=122.000, max=126.000
logo_type='weight': shape=(43, 20), col_sum min=-10171.195, max=-5.633
logo_type='information': shape=(43, 20), col_sum min=0.251, max=1.555

A df_parts DataFrame with at least one of jmd_n, tmd, jmd_c columns. All columns are concatenated row-wise into sequences before the logo is computed. The number of positions in the result equals the total concatenated sequence length.

aal = aa.AAlogo()

# All three parts: n_positions = len(jmd_n) + len(tmd) + len(jmd_c)
df_parts_all = sf.get_df_parts(df_seq=df_seq, list_parts=["jmd_n", "tmd", "jmd_c"])
df_logo_all = aal.get_df_logo(df_parts=df_parts_all)

# TMD only: n_positions = len(tmd)
df_parts_tmd = sf.get_df_parts(df_seq=df_seq, list_parts=["tmd"])
df_logo_tmd = aal.get_df_logo(df_parts=df_parts_tmd)

print(f"jmd_n + tmd + jmd_c: {df_logo_all.shape[0]} positions")
print(f"tmd only:            {df_logo_tmd.shape[0]} positions")

jmd_n + tmd + jmd_c: 43 positions
tmd only:            23 positions

labels filters rows of df_parts to only those where labels == label_test before computing the logo. This allows group-specific logos.

# Without labels: all 100 samples
df_logo_all = aal.get_df_logo(df_parts=df_parts)

# label_test=1: only positive samples
df_logo_pos = aal.get_df_logo(df_parts=df_parts, labels=labels, label_test=1)

# label_test=0: only negative samples
df_logo_neg = aal.get_df_logo(df_parts=df_parts, labels=labels, label_test=0)

n_pos = (labels == 1).sum()
n_neg = (labels == 0).sum()
print(f"No filter (n={len(df_parts)}): position 0 max prob = {df_logo_all.iloc[0].max():.4f}")
print(f"label=1   (n={n_pos}):  position 0 max prob = {df_logo_pos.iloc[0].max():.4f}")
print(f"label=0   (n={n_neg}):  position 0 max prob = {df_logo_neg.iloc[0].max():.4f}")

No filter (n=126): position 0 max prob = 0.1111
label=1   (n=63):  position 0 max prob = 0.1429
label=0   (n=63):  position 0 max prob = 0.1270

TMD sequences can vary in length. tmd_len truncates or pads all TMD sequences to this fixed length before concatenation. If None (default), the maximum TMD length across all samples is used — no truncation, only padding of shorter sequences.

The effect is visible in the number of positions in the output:

tmd_lengths = df_parts["tmd"].apply(len)
print(f"TMD lengths: min={tmd_lengths.min()}, max={tmd_lengths.max()}, "
      f"mean={tmd_lengths.mean():.1f}")
print()
for tmd_len in [None, tmd_lengths.max(), tmd_lengths.min()]:
    df_logo = aal.get_df_logo(df_parts=df_parts, tmd_len=tmd_len)
    jmd_len = 10 + 10  # jmd_n + jmd_c
    actual_tmd = df_logo.shape[0] - jmd_len
    print(f"tmd_len={str(tmd_len):>4}: total positions={df_logo.shape[0]}, TMD positions={actual_tmd}")

TMD lengths: min=18, max=23, mean=22.9

tmd_len=None: total positions=43, TMD positions=23
tmd_len=  23: total positions=43, TMD positions=23
tmd_len=  18: total positions=38, TMD positions=18

When tmd_len is smaller than the actual TMD length, sequences are truncated. start_n=True keeps the N-terminal end; start_n=False keeps the C-terminal end. When all TMDs have the same length, this parameter has no effect.

The difference is visible in which amino acids appear at the boundaries:

# Only relevant when tmd_len < actual TMD length
tmd_len_short = tmd_lengths.min()  # truncate to shortest TMD

df_logo_n = aal.get_df_logo(df_parts=df_parts, tmd_len=tmd_len_short, start_n=True)
df_logo_c = aal.get_df_logo(df_parts=df_parts, tmd_len=tmd_len_short, start_n=False)

# The most frequent amino acid at the first TMD position differs
jmd_n_len = df_parts["jmd_n"].apply(len).max()
first_tmd_pos = jmd_n_len  # 0-indexed
print(f"First TMD position (index {first_tmd_pos}):")
print(f"  start_n=True  (N-anchor): top AA = {df_logo_n.iloc[first_tmd_pos].idxmax()}")
print(f"  start_n=False (C-anchor): top AA = {df_logo_c.iloc[first_tmd_pos].idxmax()}")

First TMD position (index 10):
  start_n=True  (N-anchor): top AA = L
  start_n=False (C-anchor): top AA = V

Characters excluded from the logo computation. Passed directly to logomaker.alignment_to_matrix. Default is '.-' (gaps and dots). Gaps introduced by tmd_len padding are always '-', so they are ignored by default.

# Count gap characters in the sequences
from functools import reduce
all_seqs = df_parts.apply("".join, axis=1)
n_gaps = all_seqs.str.count("-").sum()
print(f"Gap characters in df_parts: {n_gaps}")
print()

# With gaps ignored (default): columns for '-' absent from logo
df_logo_default = aal.get_df_logo(df_parts=df_parts, characters_to_ignore=".-")
df_logo_no_ignore = aal.get_df_logo(df_parts=df_parts, characters_to_ignore="")

print(f"characters_to_ignore='.-': columns = {list(df_logo_default.columns[:5])}...")
print(f"characters_to_ignore='':  columns = {list(df_logo_no_ignore.columns[:5])}...")
print(f"'-' in columns (default):   {'-' in df_logo_default.columns}")
print(f"'-' in columns (no ignore): {'-' in df_logo_no_ignore.columns}")

Gap characters in df_parts: 0

characters_to_ignore='.-': columns = [np.str_('A'), np.str_('C'), np.str_('D'), np.str_('E'), np.str_('F')]...
characters_to_ignore='':  columns = [np.str_('-'), np.str_('A'), np.str_('C'), np.str_('D'), np.str_('E')]...
'-' in columns (default):   False
'-' in columns (no ignore): True

Added to all amino acid counts before computing the logo. Passed directly to logomaker.alignment_to_matrix. Default is 0.0. A non-zero pseudocount smooths the distribution — rare amino acids get a non-zero probability and the max probability at any position decreases.

for pseudocount in [0.0, 0.1, 0.5, 1.0]:
    df_logo = aal.get_df_logo(df_parts=df_parts, pseudocount=pseudocount)
    max_prob = df_logo.max().max()  # highest single AA probability across all positions
    min_prob = df_logo[df_logo > 0].min().min()  # lowest non-zero probability
    print(f"pseudocount={pseudocount}: max_prob={max_prob:.4f}, min_nonzero_prob={min_prob:.4f}")

pseudocount=0.0: max_prob=0.3413, min_nonzero_prob=0.0079
pseudocount=0.1: max_prob=0.3367, min_nonzero_prob=0.0008
pseudocount=0.5: max_prob=0.3199, min_nonzero_prob=0.0037
pseudocount=1.0: max_prob=0.3014, min_nonzero_prob=0.0068