aaanalysis.AAlogo.get_df_logo_info

AAlogo.get_df_logo_info(df_parts=None, labels=None, label_test=1, tmd_len=None, start_n=True, characters_to_ignore='.-', pseudocount=0.0)[source]

Compute per-position information content (in bits) from sequence parts.

Information content is computed using the information logo type regardless of the logo_type set during initialization. The result reflects sequence conservation: higher values indicate stronger conservation at that position.

Parameters:
  • df_parts (pd.DataFrame, shape (n_samples, n_parts)) – Sequence parts DataFrame with at least one column from the standard parts: jmd_n, tmd, jmd_c.

  • labels (array-like, shape (n_samples,), optional) – Class labels for samples in df_parts. If provided, only samples with label_test are included.

  • label_test (int, default=1) – Class label of the test group to select from labels.

  • tmd_len (int, optional) – Fixed length (>=1) to align all TMD sequences. If None, the maximum TMD length in df_parts is used.

  • start_n (bool, default=True) – Alignment direction for variable-length TMDs.

  • characters_to_ignore (str, default='.-') – Characters excluded from the logo matrix computation.

  • pseudocount (float, default=0.0) – Pseudocount (>=0) added to all amino acid counts.

Returns:

df_logo_info – Per-position information content in bits, with index named ‘pos’. Values range from 0 (no conservation) to ~4.248 (fully conserved).

Return type:

pd.Series, shape (n_positions,)

See also

Examples

The AALogo.get_df_logo_info method computes per-position information content (in bits) by calling get_df_logo_ internally with logo_type='information' hardcoded, then summing values across amino acids per position (df_logo.sum(axis=1)).

The result is always in bits regardless of the logo_type set at AALogo initialization. Values range from 0 (no conservation) to log2(20) ≈ 4.248 (fully conserved).

import warnings
warnings.filterwarnings('ignore')
import aaanalysis as aa
import matplotlib.pyplot as plt
import numpy as np

aa.plot_settings()

sf = aa.SequenceFeature()
df_seq = aa.load_dataset(name="DOM_GSEC", n=100)
labels = df_seq["label"].values
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["jmd_n", "tmd", "jmd_c"])

aalogo = aa.AAlogo()

get_df_logo_info hardcodes logo_type='information' internally. The logo_type set at initialization is completely ignored. All four initializations produce identical results:

results = {}
for logo_type in ["probability", "counts", "weight", "information"]:
    df_info = aa.AAlogo(logo_type=logo_type).get_df_logo_info(df_parts=df_parts)
    results[logo_type] = df_info
    print(f"logo_type='{logo_type}': mean={df_info.mean():.6f} bits")

# Confirm all are identical
ref = results["probability"]
all_equal = all(results[lt].equals(ref) for lt in ["counts", "weight", "information"])
print(f"\nAll results identical: {all_equal}")
logo_type='probability': mean=0.800787 bits
logo_type='counts': mean=0.800787 bits
logo_type='weight': mean=0.800787 bits
logo_type='information': mean=0.800787 bits

All results identical: True

Same as get_df_logo — columns are concatenated into sequences. The result is a pd.Series with index name 'pos' and one value per position.

df_logo_info = aalogo.get_df_logo_info(df_parts=df_parts)
print(f"Type:       {type(df_logo_info).__name__}")
print(f"Index name: '{df_logo_info.index.name}'")
print(f"n_positions: {len(df_logo_info)}")
print(f"Range:      [{df_logo_info.min():.4f}, {df_logo_info.max():.4f}] bits")
df_logo_info.head()
Type:       Series
Index name: 'pos'
n_positions: 43
Range:      [0.2508, 1.5553] bits
pos
0    0.250782
1    0.297012
2    0.358037
3    0.322563
4    0.383782
dtype: float64

The result equals df_logo.sum(axis=1) when computed with logo_type='information':

df_logo_info_manual = aa.AAlogo(logo_type="information").get_df_logo(df_parts=df_parts).sum(axis=1)
print(f"Matches get_df_logo('information').sum(axis=1): "
      f"{np.allclose(df_logo_info.values, df_logo_info_manual.values)}")
Matches get_df_logo('information').sum(axis=1): True

Filters df_parts to rows where labels == label_test before computing. Allows comparing per-position conservation between groups.

df_info_pos = aalogo.get_df_logo_info(df_parts=df_parts, labels=labels, label_test=1)
df_info_neg = aalogo.get_df_logo_info(df_parts=df_parts, labels=labels, label_test=0)

print(f"Mean conservation (bits):")
print(f"  All samples (n={len(df_parts)}):  {df_logo_info.mean():.4f}")
print(f"  Positive    (n={(labels==1).sum()}): {df_info_pos.mean():.4f}")
print(f"  Negative    (n={(labels==0).sum()}): {df_info_neg.mean():.4f}")
print()
# Most conserved position per group
print(f"Most conserved position:")
print(f"  Positive: position {df_info_pos.idxmax()} ({df_info_pos.max():.4f} bits)")
print(f"  Negative: position {df_info_neg.idxmax()} ({df_info_neg.max():.4f} bits)")
Mean conservation (bits):
  All samples (n=126):  0.8008
  Positive    (n=63): 1.1192
  Negative    (n=63): 0.7748

Most conserved position:
  Positive: position 26 (1.8965 bits)
  Negative: position 22 (1.5067 bits)

Fixes the TMD length before computing. Changes the number of positions in the output.

tmd_lengths = df_parts["tmd"].apply(len)
for tmd_len in [None, tmd_lengths.max(), tmd_lengths.min()]:
    df_info = aalogo.get_df_logo_info(df_parts=df_parts, tmd_len=tmd_len)
    print(f"tmd_len={str(tmd_len):>4}: {len(df_info)} positions, mean={df_info.mean():.4f} bits")
tmd_len=None: 43 positions, mean=0.8008 bits
tmd_len=  23: 43 positions, mean=0.8008 bits
tmd_len=  18: 38 positions, mean=0.7900 bits

Only has an effect when tmd_len < actual TMD length (truncation). Determines which end of the TMD is kept. When all TMDs have the same length, results are identical.

tmd_len_short = tmd_lengths.min()  # truncate to shortest
df_info_n = aalogo.get_df_logo_info(df_parts=df_parts, tmd_len=tmd_len_short, start_n=True)
df_info_c = aalogo.get_df_logo_info(df_parts=df_parts, tmd_len=tmd_len_short, start_n=False)

print(f"tmd_len={tmd_len_short} (truncation active):")
print(f"  start_n=True  (N-anchor): mean={df_info_n.mean():.4f}, max={df_info_n.max():.4f} bits")
print(f"  start_n=False (C-anchor): mean={df_info_c.mean():.4f}, max={df_info_c.max():.4f} bits")
print(f"  Results differ: {not df_info_n.equals(df_info_c)}")
tmd_len=18 (truncation active):
  start_n=True  (N-anchor): mean=0.7900, max=1.5553 bits
  start_n=False (C-anchor): mean=0.7801, max=1.5657 bits
  Results differ: True

Smooths the amino acid distribution before computing information content. Higher values reduce the maximum information content (conservation appears lower).

for pseudocount in [0.0, 0.1, 0.5, 1.0]:
    df_info = aalogo.get_df_logo_info(df_parts=df_parts, pseudocount=pseudocount)
    print(f"pseudocount={pseudocount}: mean={df_info.mean():.4f}, max={df_info.max():.4f} bits")
pseudocount=0.0: mean=0.8008, max=1.5553 bits
pseudocount=0.1: mean=0.7635, max=1.4819 bits
pseudocount=0.5: mean=0.6581, max=1.2794 bits
pseudocount=1.0: mean=0.5623, max=1.0972 bits

Characters excluded before computing information content. Excluding gap characters (default '.-') prevents gaps from contributing to conservation scores.

for chars in [".-", ""]:
    df_info = aalogo.get_df_logo_info(df_parts=df_parts, characters_to_ignore=chars)
    print(f"characters_to_ignore='{chars}': mean={df_info.mean():.4f}, max={df_info.max():.4f} bits")
characters_to_ignore='.-': mean=0.8008, max=1.5553 bits
characters_to_ignore='': mean=0.8628, max=1.6257 bits