Logo

OVERVIEW

  • Introduction
  • Contributing
  • Docstring Style Guide
  • Usage Principles
  • Evaluation

EXAMPLES

  • Tutorials

REFERENCES

  • API
    • Data Handling
    • Sequence Analysis
      • aaanalysis.AAlogo
        • AAlogo
        • aaanalysis.AAlogo.get_conservation
        • aaanalysis.AAlogo.get_df_logo
        • aaanalysis.AAlogo.get_df_logo_info
      • aaanalysis.AAlogoPlot
      • aaanalysis.AAWindowSampler
      • aaanalysis.comp_seq_sim
      • aaanalysis.filter_seq
      • aaanalysis.scan_motif
    • Feature Engineering
    • PU Learning
    • Explainable AI
    • Protein Design
    • Utility Functions
  • Tables
  • References
  • Release notes
AAanalysis
  • API
  • aaanalysis.AAlogo
  • aaanalysis.AAlogo.get_df_logo
  • Edit on GitHub

aaanalysis.AAlogo.get_df_logo

AAlogo.get_df_logo(df_parts=None, labels=None, label_test=1, tmd_len=None, start_n=True, characters_to_ignore='.-', pseudocount=0.0)[source]

Compute a sequence logo matrix for the provided sequence parts.

For each residue position, the relative frequency (or another encoding) of each amino acid is computed across all sequences. If variable-length TMD sequences are provided, they are aligned to a uniform length via N- or C-terminal padding before computing the logo.

Parameters:
  • df_parts (pd.DataFrame, shape (n_samples, n_parts)) – Sequence parts DataFrame with at least one column from the standard parts: jmd_n, tmd, jmd_c. Must be a valid parts DataFrame.

  • labels (array-like, shape (n_samples,), optional) – Class labels for samples in df_parts. If provided, only samples with label_test are included in the logo computation.

  • label_test (int, default=1) – Class label of the test group to select from labels.

  • tmd_len (int, optional) – Fixed length (>=1) to align all TMD sequences. If None, the maximum TMD length in df_parts is used. Only relevant if tmd column is present.

  • start_n (bool, default=True) –

    Alignment direction for variable-length TMDs:

    • True: Align from N-terminus (C-terminal padding with gaps).

    • False: Align from C-terminus (N-terminal padding with gaps).

  • characters_to_ignore (str, default='.-') – Characters excluded from the logo matrix computation.

  • pseudocount (float, default=0.0) – Pseudocount (>=0) added to all amino acid counts to avoid log(0) issues.

Returns:

df_logo – Logo matrix with residue positions as rows and amino acids as columns.

Return type:

pd.DataFrame, shape (n_positions, n_amino_acids)

See also

  • AAlogo.get_df_logo_info(): for per-position information content.

  • logomaker.alignment_to_matrix: the underlying matrix computation function.

Examples

The AALogo.get_df_logo method computes a sequence logo matrix from a df_parts DataFrame using logomaker.alignment_to_matrix. Each row is a residue position, each column an amino acid. The encoding is controlled by the logo_type set at initialization.

import warnings
warnings.filterwarnings('ignore')
import aaanalysis as aa
import logomaker
import matplotlib.pyplot as plt

aa.plot_settings()

sf = aa.SequenceFeature()
df_seq = aa.load_dataset(name="DOM_GSEC", n=100)
labels = df_seq["label"].values
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["jmd_n", "tmd", "jmd_c"])
df_parts.head(3)
jmd_n tmd jmd_c
entry
Q14802 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
Q86UE4 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
Q969W9 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI

Controls how amino acid frequencies are encoded. It is passed directly to logomaker.alignment_to_matrix(to_type=logo_type):

  • 'probability' (default): each position sums to 1

  • 'counts': raw amino acid counts

  • 'weight': log-odds weight matrix

  • 'information': information content in bits

The effect is visible in the column sums per position:

for logo_type in ["probability", "counts", "weight", "information"]:
    df_logo = aa.AAlogo(logo_type=logo_type).get_df_logo(df_parts=df_parts)
    col_sum = df_logo.sum(axis=1)  # sum across amino acids per position
    print(f"logo_type='{logo_type}': shape={df_logo.shape}, "
          f"col_sum min={col_sum.min():.3f}, max={col_sum.max():.3f}")
logo_type='probability': shape=(43, 20), col_sum min=1.000, max=1.000
logo_type='counts': shape=(43, 20), col_sum min=122.000, max=126.000
logo_type='weight': shape=(43, 20), col_sum min=-10171.195, max=-5.633
logo_type='information': shape=(43, 20), col_sum min=0.251, max=1.555

A df_parts DataFrame with at least one of jmd_n, tmd, jmd_c columns. All columns are concatenated row-wise into sequences before the logo is computed. The number of positions in the result equals the total concatenated sequence length.

aalogo = aa.AAlogo()

# All three parts: n_positions = len(jmd_n) + len(tmd) + len(jmd_c)
df_parts_all = sf.get_df_parts(df_seq=df_seq, list_parts=["jmd_n", "tmd", "jmd_c"])
df_logo_all = aalogo.get_df_logo(df_parts=df_parts_all)

# TMD only: n_positions = len(tmd)
df_parts_tmd = sf.get_df_parts(df_seq=df_seq, list_parts=["tmd"])
df_logo_tmd = aalogo.get_df_logo(df_parts=df_parts_tmd)

print(f"jmd_n + tmd + jmd_c: {df_logo_all.shape[0]} positions")
print(f"tmd only:            {df_logo_tmd.shape[0]} positions")
jmd_n + tmd + jmd_c: 43 positions
tmd only:            23 positions

labels filters rows of df_parts to only those where labels == label_test before computing the logo. This allows group-specific logos.

# Without labels: all 100 samples
df_logo_all = aalogo.get_df_logo(df_parts=df_parts)

# label_test=1: only positive samples
df_logo_pos = aalogo.get_df_logo(df_parts=df_parts, labels=labels, label_test=1)

# label_test=0: only negative samples
df_logo_neg = aalogo.get_df_logo(df_parts=df_parts, labels=labels, label_test=0)

n_pos = (labels == 1).sum()
n_neg = (labels == 0).sum()
print(f"No filter (n={len(df_parts)}): position 0 max prob = {df_logo_all.iloc[0].max():.4f}")
print(f"label=1   (n={n_pos}):  position 0 max prob = {df_logo_pos.iloc[0].max():.4f}")
print(f"label=0   (n={n_neg}):  position 0 max prob = {df_logo_neg.iloc[0].max():.4f}")
No filter (n=126): position 0 max prob = 0.1111
label=1   (n=63):  position 0 max prob = 0.1429
label=0   (n=63):  position 0 max prob = 0.1270

TMD sequences can vary in length. tmd_len truncates or pads all TMD sequences to this fixed length before concatenation. If None (default), the maximum TMD length across all samples is used — no truncation, only padding of shorter sequences.

The effect is visible in the number of positions in the output:

tmd_lengths = df_parts["tmd"].apply(len)
print(f"TMD lengths: min={tmd_lengths.min()}, max={tmd_lengths.max()}, "
      f"mean={tmd_lengths.mean():.1f}")
print()
for tmd_len in [None, tmd_lengths.max(), tmd_lengths.min()]:
    df_logo = aalogo.get_df_logo(df_parts=df_parts, tmd_len=tmd_len)
    jmd_len = 10 + 10  # jmd_n + jmd_c
    actual_tmd = df_logo.shape[0] - jmd_len
    print(f"tmd_len={str(tmd_len):>4}: total positions={df_logo.shape[0]}, TMD positions={actual_tmd}")
TMD lengths: min=18, max=23, mean=22.9

tmd_len=None: total positions=43, TMD positions=23
tmd_len=  23: total positions=43, TMD positions=23
tmd_len=  18: total positions=38, TMD positions=18

When tmd_len is smaller than the actual TMD length, sequences are truncated. start_n=True keeps the N-terminal end; start_n=False keeps the C-terminal end. When all TMDs have the same length, this parameter has no effect.

The difference is visible in which amino acids appear at the boundaries:

# Only relevant when tmd_len < actual TMD length
tmd_len_short = tmd_lengths.min()  # truncate to shortest TMD

df_logo_n = aalogo.get_df_logo(df_parts=df_parts, tmd_len=tmd_len_short, start_n=True)
df_logo_c = aalogo.get_df_logo(df_parts=df_parts, tmd_len=tmd_len_short, start_n=False)

# The most frequent amino acid at the first TMD position differs
jmd_n_len = df_parts["jmd_n"].apply(len).max()
first_tmd_pos = jmd_n_len  # 0-indexed
print(f"First TMD position (index {first_tmd_pos}):")
print(f"  start_n=True  (N-anchor): top AA = {df_logo_n.iloc[first_tmd_pos].idxmax()}")
print(f"  start_n=False (C-anchor): top AA = {df_logo_c.iloc[first_tmd_pos].idxmax()}")
First TMD position (index 10):
  start_n=True  (N-anchor): top AA = L
  start_n=False (C-anchor): top AA = V

Characters excluded from the logo computation. Passed directly to logomaker.alignment_to_matrix. Default is '.-' (gaps and dots). Gaps introduced by tmd_len padding are always '-', so they are ignored by default.

# Count gap characters in the sequences
from functools import reduce
all_seqs = df_parts.apply("".join, axis=1)
n_gaps = all_seqs.str.count("-").sum()
print(f"Gap characters in df_parts: {n_gaps}")
print()

# With gaps ignored (default): columns for '-' absent from logo
df_logo_default = aalogo.get_df_logo(df_parts=df_parts, characters_to_ignore=".-")
df_logo_no_ignore = aalogo.get_df_logo(df_parts=df_parts, characters_to_ignore="")

print(f"characters_to_ignore='.-': columns = {list(df_logo_default.columns[:5])}...")
print(f"characters_to_ignore='':  columns = {list(df_logo_no_ignore.columns[:5])}...")
print(f"'-' in columns (default):   {'-' in df_logo_default.columns}")
print(f"'-' in columns (no ignore): {'-' in df_logo_no_ignore.columns}")
Gap characters in df_parts: 0

characters_to_ignore='.-': columns = [np.str_('A'), np.str_('C'), np.str_('D'), np.str_('E'), np.str_('F')]...
characters_to_ignore='':  columns = [np.str_('-'), np.str_('A'), np.str_('C'), np.str_('D'), np.str_('E')]...
'-' in columns (default):   False
'-' in columns (no ignore): True

Added to all amino acid counts before computing the logo. Passed directly to logomaker.alignment_to_matrix. Default is 0.0. A non-zero pseudocount smooths the distribution — rare amino acids get a non-zero probability and the max probability at any position decreases.

for pseudocount in [0.0, 0.1, 0.5, 1.0]:
    df_logo = aalogo.get_df_logo(df_parts=df_parts, pseudocount=pseudocount)
    max_prob = df_logo.max().max()  # highest single AA probability across all positions
    min_prob = df_logo[df_logo > 0].min().min()  # lowest non-zero probability
    print(f"pseudocount={pseudocount}: max_prob={max_prob:.4f}, min_nonzero_prob={min_prob:.4f}")
pseudocount=0.0: max_prob=0.3413, min_nonzero_prob=0.0079
pseudocount=0.1: max_prob=0.3367, min_nonzero_prob=0.0008
pseudocount=0.5: max_prob=0.3199, min_nonzero_prob=0.0037
pseudocount=1.0: max_prob=0.3014, min_nonzero_prob=0.0068
Previous Next

© Copyright 2026, Stephan Breimann.