aaanalysis.AAlogo.get_df_logo_info
- AAlogo.get_df_logo_info(df_parts=None, labels=None, label_test=1, tmd_len=None, start_n=True, characters_to_ignore='.-', pseudocount=0.0)[source]
Compute per-position information content (in bits) from sequence parts.
Information content is computed using the information logo type regardless of the
logo_typeset during initialization. The result reflects sequence conservation: higher values indicate stronger conservation at that position.- Parameters:
df_parts (pd.DataFrame, shape (n_samples, n_parts)) – Sequence parts DataFrame with at least one column from the standard parts:
jmd_n,tmd,jmd_c.labels (array-like, shape (n_samples,), optional) – Class labels for samples in
df_parts. If provided, only samples withlabel_testare included.label_test (int, default=1) – Class label of the test group to select from
labels.tmd_len (int, optional) – Fixed length (>=1) to align all TMD sequences. If
None, the maximum TMD length indf_partsis used.start_n (bool, default=True) – Alignment direction for variable-length TMDs.
characters_to_ignore (str, default='.-') – Characters excluded from the logo matrix computation.
pseudocount (float, default=0.0) – Pseudocount (>=0) added to all amino acid counts.
- Returns:
df_logo_info – Per-position information content in bits, with index named ‘pos’. Values range from 0 (no conservation) to ~4.248 (fully conserved).
- Return type:
pd.Series, shape (n_positions,)
See also
AAlogo.get_conservation(): to summarize the per-position scores into a single value.AAlogo.get_df_logo(): for the full logo matrix.
Examples
The
AALogo.get_df_logo_infomethod computes per-position information content (in bits) by callingget_df_logo_internally withlogo_type='information'hardcoded, then summing values across amino acids per position (df_logo.sum(axis=1)).The result is always in bits regardless of the
logo_typeset atAALogoinitialization. Values range from 0 (no conservation) to log2(20) ≈ 4.248 (fully conserved).import warnings warnings.filterwarnings('ignore') import aaanalysis as aa import matplotlib.pyplot as plt import numpy as np aa.plot_settings() sf = aa.SequenceFeature() df_seq = aa.load_dataset(name="DOM_GSEC", n=100) labels = df_seq["label"].values df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["jmd_n", "tmd", "jmd_c"]) aalogo = aa.AAlogo()
get_df_logo_infohardcodeslogo_type='information'internally. Thelogo_typeset at initialization is completely ignored. All four initializations produce identical results:results = {} for logo_type in ["probability", "counts", "weight", "information"]: df_info = aa.AAlogo(logo_type=logo_type).get_df_logo_info(df_parts=df_parts) results[logo_type] = df_info print(f"logo_type='{logo_type}': mean={df_info.mean():.6f} bits") # Confirm all are identical ref = results["probability"] all_equal = all(results[lt].equals(ref) for lt in ["counts", "weight", "information"]) print(f"\nAll results identical: {all_equal}")
logo_type='probability': mean=0.800787 bits logo_type='counts': mean=0.800787 bits logo_type='weight': mean=0.800787 bits logo_type='information': mean=0.800787 bits All results identical: True
Same as
get_df_logo— columns are concatenated into sequences. The result is apd.Serieswith index name'pos'and one value per position.df_logo_info = aalogo.get_df_logo_info(df_parts=df_parts) print(f"Type: {type(df_logo_info).__name__}") print(f"Index name: '{df_logo_info.index.name}'") print(f"n_positions: {len(df_logo_info)}") print(f"Range: [{df_logo_info.min():.4f}, {df_logo_info.max():.4f}] bits") df_logo_info.head()
Type: Series Index name: 'pos' n_positions: 43 Range: [0.2508, 1.5553] bits
pos 0 0.250782 1 0.297012 2 0.358037 3 0.322563 4 0.383782 dtype: float64
The result equals
df_logo.sum(axis=1)when computed withlogo_type='information':df_logo_info_manual = aa.AAlogo(logo_type="information").get_df_logo(df_parts=df_parts).sum(axis=1) print(f"Matches get_df_logo('information').sum(axis=1): " f"{np.allclose(df_logo_info.values, df_logo_info_manual.values)}")
Matches get_df_logo('information').sum(axis=1): True
Filters
df_partsto rows wherelabels == label_testbefore computing. Allows comparing per-position conservation between groups.df_info_pos = aalogo.get_df_logo_info(df_parts=df_parts, labels=labels, label_test=1) df_info_neg = aalogo.get_df_logo_info(df_parts=df_parts, labels=labels, label_test=0) print(f"Mean conservation (bits):") print(f" All samples (n={len(df_parts)}): {df_logo_info.mean():.4f}") print(f" Positive (n={(labels==1).sum()}): {df_info_pos.mean():.4f}") print(f" Negative (n={(labels==0).sum()}): {df_info_neg.mean():.4f}") print() # Most conserved position per group print(f"Most conserved position:") print(f" Positive: position {df_info_pos.idxmax()} ({df_info_pos.max():.4f} bits)") print(f" Negative: position {df_info_neg.idxmax()} ({df_info_neg.max():.4f} bits)")
Mean conservation (bits): All samples (n=126): 0.8008 Positive (n=63): 1.1192 Negative (n=63): 0.7748 Most conserved position: Positive: position 26 (1.8965 bits) Negative: position 22 (1.5067 bits)
Fixes the TMD length before computing. Changes the number of positions in the output.
tmd_lengths = df_parts["tmd"].apply(len) for tmd_len in [None, tmd_lengths.max(), tmd_lengths.min()]: df_info = aalogo.get_df_logo_info(df_parts=df_parts, tmd_len=tmd_len) print(f"tmd_len={str(tmd_len):>4}: {len(df_info)} positions, mean={df_info.mean():.4f} bits")
tmd_len=None: 43 positions, mean=0.8008 bits tmd_len= 23: 43 positions, mean=0.8008 bits tmd_len= 18: 38 positions, mean=0.7900 bits
Only has an effect when
tmd_len< actual TMD length (truncation). Determines which end of the TMD is kept. When all TMDs have the same length, results are identical.tmd_len_short = tmd_lengths.min() # truncate to shortest df_info_n = aalogo.get_df_logo_info(df_parts=df_parts, tmd_len=tmd_len_short, start_n=True) df_info_c = aalogo.get_df_logo_info(df_parts=df_parts, tmd_len=tmd_len_short, start_n=False) print(f"tmd_len={tmd_len_short} (truncation active):") print(f" start_n=True (N-anchor): mean={df_info_n.mean():.4f}, max={df_info_n.max():.4f} bits") print(f" start_n=False (C-anchor): mean={df_info_c.mean():.4f}, max={df_info_c.max():.4f} bits") print(f" Results differ: {not df_info_n.equals(df_info_c)}")
tmd_len=18 (truncation active): start_n=True (N-anchor): mean=0.7900, max=1.5553 bits start_n=False (C-anchor): mean=0.7801, max=1.5657 bits Results differ: True
Smooths the amino acid distribution before computing information content. Higher values reduce the maximum information content (conservation appears lower).
for pseudocount in [0.0, 0.1, 0.5, 1.0]: df_info = aalogo.get_df_logo_info(df_parts=df_parts, pseudocount=pseudocount) print(f"pseudocount={pseudocount}: mean={df_info.mean():.4f}, max={df_info.max():.4f} bits")
pseudocount=0.0: mean=0.8008, max=1.5553 bits pseudocount=0.1: mean=0.7635, max=1.4819 bits pseudocount=0.5: mean=0.6581, max=1.2794 bits pseudocount=1.0: mean=0.5623, max=1.0972 bits
Characters excluded before computing information content. Excluding gap characters (default
'.-') prevents gaps from contributing to conservation scores.for chars in [".-", ""]: df_info = aalogo.get_df_logo_info(df_parts=df_parts, characters_to_ignore=chars) print(f"characters_to_ignore='{chars}': mean={df_info.mean():.4f}, max={df_info.max():.4f} bits")
characters_to_ignore='.-': mean=0.8008, max=1.5553 bits characters_to_ignore='': mean=0.8628, max=1.6257 bits