aaanalysis.AAlogo.get_df_logo
- AAlogo.get_df_logo(df_parts=None, labels=None, label_test=1, tmd_len=None, start_n=True, characters_to_ignore='.-', pseudocount=0.0)[source]
Compute a sequence logo matrix for the provided sequence parts.
For each residue position, the relative frequency (or another encoding) of each amino acid is computed across all sequences. If variable-length TMD sequences are provided, they are aligned to a uniform length via N- or C-terminal padding before computing the logo.
- Parameters:
df_parts (pd.DataFrame, shape (n_samples, n_parts)) – Sequence parts DataFrame with at least one column from the standard parts:
jmd_n,tmd,jmd_c. Must be a valid parts DataFrame.labels (array-like, shape (n_samples,), optional) – Class labels for samples in
df_parts. If provided, only samples withlabel_testare included in the logo computation.label_test (int, default=1) – Class label of the test group to select from
labels.tmd_len (int, optional) – Fixed length (>=1) to align all TMD sequences. If
None, the maximum TMD length indf_partsis used. Only relevant iftmdcolumn is present.start_n (bool, default=True) –
Alignment direction for variable-length TMDs:
True: Align from N-terminus (C-terminal padding with gaps).False: Align from C-terminus (N-terminal padding with gaps).
characters_to_ignore (str, default='.-') – Characters excluded from the logo matrix computation.
pseudocount (float, default=0.0) – Pseudocount (>=0) added to all amino acid counts to avoid log(0) issues.
- Returns:
df_logo – Logo matrix with residue positions as rows and amino acids as columns.
- Return type:
pd.DataFrame, shape (n_positions, n_amino_acids)
See also
AAlogo.get_df_logo_info(): for per-position information content.logomaker.alignment_to_matrix: the underlying matrix computation function.
Examples
The
AALogo.get_df_logomethod computes a sequence logo matrix from adf_partsDataFrame usinglogomaker.alignment_to_matrix. Each row is a residue position, each column an amino acid. The encoding is controlled by thelogo_typeset at initialization.import warnings warnings.filterwarnings('ignore') import aaanalysis as aa import logomaker import matplotlib.pyplot as plt aa.plot_settings() sf = aa.SequenceFeature() df_seq = aa.load_dataset(name="DOM_GSEC", n=100) labels = df_seq["label"].values df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["jmd_n", "tmd", "jmd_c"]) df_parts.head(3)
jmd_n tmd jmd_c entry Q14802 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS Q86UE4 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR Q969W9 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI Controls how amino acid frequencies are encoded. It is passed directly to
logomaker.alignment_to_matrix(to_type=logo_type):'probability'(default): each position sums to 1'counts': raw amino acid counts'weight': log-odds weight matrix'information': information content in bits
The effect is visible in the column sums per position:
for logo_type in ["probability", "counts", "weight", "information"]: df_logo = aa.AAlogo(logo_type=logo_type).get_df_logo(df_parts=df_parts) col_sum = df_logo.sum(axis=1) # sum across amino acids per position print(f"logo_type='{logo_type}': shape={df_logo.shape}, " f"col_sum min={col_sum.min():.3f}, max={col_sum.max():.3f}")
logo_type='probability': shape=(43, 20), col_sum min=1.000, max=1.000 logo_type='counts': shape=(43, 20), col_sum min=122.000, max=126.000 logo_type='weight': shape=(43, 20), col_sum min=-10171.195, max=-5.633 logo_type='information': shape=(43, 20), col_sum min=0.251, max=1.555
A
df_partsDataFrame with at least one ofjmd_n,tmd,jmd_ccolumns. All columns are concatenated row-wise into sequences before the logo is computed. The number of positions in the result equals the total concatenated sequence length.aalogo = aa.AAlogo() # All three parts: n_positions = len(jmd_n) + len(tmd) + len(jmd_c) df_parts_all = sf.get_df_parts(df_seq=df_seq, list_parts=["jmd_n", "tmd", "jmd_c"]) df_logo_all = aalogo.get_df_logo(df_parts=df_parts_all) # TMD only: n_positions = len(tmd) df_parts_tmd = sf.get_df_parts(df_seq=df_seq, list_parts=["tmd"]) df_logo_tmd = aalogo.get_df_logo(df_parts=df_parts_tmd) print(f"jmd_n + tmd + jmd_c: {df_logo_all.shape[0]} positions") print(f"tmd only: {df_logo_tmd.shape[0]} positions")
jmd_n + tmd + jmd_c: 43 positions tmd only: 23 positions
labelsfilters rows ofdf_partsto only those wherelabels == label_testbefore computing the logo. This allows group-specific logos.# Without labels: all 100 samples df_logo_all = aalogo.get_df_logo(df_parts=df_parts) # label_test=1: only positive samples df_logo_pos = aalogo.get_df_logo(df_parts=df_parts, labels=labels, label_test=1) # label_test=0: only negative samples df_logo_neg = aalogo.get_df_logo(df_parts=df_parts, labels=labels, label_test=0) n_pos = (labels == 1).sum() n_neg = (labels == 0).sum() print(f"No filter (n={len(df_parts)}): position 0 max prob = {df_logo_all.iloc[0].max():.4f}") print(f"label=1 (n={n_pos}): position 0 max prob = {df_logo_pos.iloc[0].max():.4f}") print(f"label=0 (n={n_neg}): position 0 max prob = {df_logo_neg.iloc[0].max():.4f}")
No filter (n=126): position 0 max prob = 0.1111 label=1 (n=63): position 0 max prob = 0.1429 label=0 (n=63): position 0 max prob = 0.1270
TMD sequences can vary in length.
tmd_lentruncates or pads all TMD sequences to this fixed length before concatenation. IfNone(default), the maximum TMD length across all samples is used — no truncation, only padding of shorter sequences.The effect is visible in the number of positions in the output:
tmd_lengths = df_parts["tmd"].apply(len) print(f"TMD lengths: min={tmd_lengths.min()}, max={tmd_lengths.max()}, " f"mean={tmd_lengths.mean():.1f}") print() for tmd_len in [None, tmd_lengths.max(), tmd_lengths.min()]: df_logo = aalogo.get_df_logo(df_parts=df_parts, tmd_len=tmd_len) jmd_len = 10 + 10 # jmd_n + jmd_c actual_tmd = df_logo.shape[0] - jmd_len print(f"tmd_len={str(tmd_len):>4}: total positions={df_logo.shape[0]}, TMD positions={actual_tmd}")
TMD lengths: min=18, max=23, mean=22.9 tmd_len=None: total positions=43, TMD positions=23 tmd_len= 23: total positions=43, TMD positions=23 tmd_len= 18: total positions=38, TMD positions=18
When
tmd_lenis smaller than the actual TMD length, sequences are truncated.start_n=Truekeeps the N-terminal end;start_n=Falsekeeps the C-terminal end. When all TMDs have the same length, this parameter has no effect.The difference is visible in which amino acids appear at the boundaries:
# Only relevant when tmd_len < actual TMD length tmd_len_short = tmd_lengths.min() # truncate to shortest TMD df_logo_n = aalogo.get_df_logo(df_parts=df_parts, tmd_len=tmd_len_short, start_n=True) df_logo_c = aalogo.get_df_logo(df_parts=df_parts, tmd_len=tmd_len_short, start_n=False) # The most frequent amino acid at the first TMD position differs jmd_n_len = df_parts["jmd_n"].apply(len).max() first_tmd_pos = jmd_n_len # 0-indexed print(f"First TMD position (index {first_tmd_pos}):") print(f" start_n=True (N-anchor): top AA = {df_logo_n.iloc[first_tmd_pos].idxmax()}") print(f" start_n=False (C-anchor): top AA = {df_logo_c.iloc[first_tmd_pos].idxmax()}")
First TMD position (index 10): start_n=True (N-anchor): top AA = L start_n=False (C-anchor): top AA = V
Characters excluded from the logo computation. Passed directly to
logomaker.alignment_to_matrix. Default is'.-'(gaps and dots). Gaps introduced bytmd_lenpadding are always'-', so they are ignored by default.# Count gap characters in the sequences from functools import reduce all_seqs = df_parts.apply("".join, axis=1) n_gaps = all_seqs.str.count("-").sum() print(f"Gap characters in df_parts: {n_gaps}") print() # With gaps ignored (default): columns for '-' absent from logo df_logo_default = aalogo.get_df_logo(df_parts=df_parts, characters_to_ignore=".-") df_logo_no_ignore = aalogo.get_df_logo(df_parts=df_parts, characters_to_ignore="") print(f"characters_to_ignore='.-': columns = {list(df_logo_default.columns[:5])}...") print(f"characters_to_ignore='': columns = {list(df_logo_no_ignore.columns[:5])}...") print(f"'-' in columns (default): {'-' in df_logo_default.columns}") print(f"'-' in columns (no ignore): {'-' in df_logo_no_ignore.columns}")
Gap characters in df_parts: 0 characters_to_ignore='.-': columns = [np.str_('A'), np.str_('C'), np.str_('D'), np.str_('E'), np.str_('F')]... characters_to_ignore='': columns = [np.str_('-'), np.str_('A'), np.str_('C'), np.str_('D'), np.str_('E')]... '-' in columns (default): False '-' in columns (no ignore): True
Added to all amino acid counts before computing the logo. Passed directly to
logomaker.alignment_to_matrix. Default is0.0. A non-zero pseudocount smooths the distribution — rare amino acids get a non-zero probability and the max probability at any position decreases.for pseudocount in [0.0, 0.1, 0.5, 1.0]: df_logo = aalogo.get_df_logo(df_parts=df_parts, pseudocount=pseudocount) max_prob = df_logo.max().max() # highest single AA probability across all positions min_prob = df_logo[df_logo > 0].min().min() # lowest non-zero probability print(f"pseudocount={pseudocount}: max_prob={max_prob:.4f}, min_nonzero_prob={min_prob:.4f}")
pseudocount=0.0: max_prob=0.3413, min_nonzero_prob=0.0079 pseudocount=0.1: max_prob=0.3367, min_nonzero_prob=0.0008 pseudocount=0.5: max_prob=0.3199, min_nonzero_prob=0.0037 pseudocount=1.0: max_prob=0.3014, min_nonzero_prob=0.0068