Docstring Style Guide
This is the single source of truth for how docstrings are written in
AAanalysis. The mature classes CPP,
AAclust, and dPULearn are the gold
standard — every public symbol should read as if written by the same hand.
The guide is enforced by an internal checker
(.claude/skills/docstrings/scripts/check_docstrings.py); run it
before and after touching any docstring. All other docstring notes in the
repository (the always-on .claude/rules/docstrings.md, the
docstrings skill, and the Docstring Style section of
CONTRIBUTING) are thin pointers to this page and define no rules of their
own.
Style basis: numpydoc
+ PEP 257. Sections appear in the
numpydoc order Parameters → Returns → Raises → Notes → Warnings → See Also →
References → Examples.
Class docstrings
class AAclust(Wrapper):
"""
Amino Acid clustering (**AAclust**) class: a k-optimized clustering wrapper
for selecting redundancy-reduced sets of numerical scales [Breimann24a]_.
<One expanded paragraph of purpose; optionally a ``*``-bulleted breakdown.>
.. versionadded:: 0.1.0
Attributes
----------
labels_ : array-like, shape (n_samples,)
Cluster labels in the order of samples in ``X``.
"""
def __init__(self, model_class=KMeans, verbose=True, random_state=None):
"""
Parameters
----------
random_state : int, optional
The seed used by the random number generator. If a positive
integer, results of stochastic processes are consistent, enabling
reproducibility. If ``None``, stochastic processes are truly random.
Notes
-----
* <cross-cutting caveats>
See Also
--------
* :class:`AAclustPlot`: the respective plotting class.
Examples
--------
.. include:: examples/aaclust.rst
"""
Invariants:
Summary is a noun phrase (
<Full Name> (**ACRONYM**) class ...) on the line after a blank first line, present tense; not an imperative verb.Citations are the exception, not the default. A class earns a
[Key]_citation only when it is an important class that a specific reference describes — its own paper, or the project paper[Breimann25]_for the core γ-secretase CPP / dPULearn / TreeModel algorithms it covers. Most classes — every data-prep / utility / helper class (preprocessors, loaders, ``NumericalFeature``, ``AAlogo``, …) — carry no citation, and that is correct, not a gap. Never add one to satisfy a checker note. Verify before adding, and never invent one: the key must be defined inreferences.rst(the checker’sCITATION-UNDEFINEDflags typo’d / fabricated keys) and the cited work must actually describe this class.CLASS-NO-CITATIONis advisory — a reminder to confirm an important class isn’t missing its citation, not a prompt to cite utilities; a wrong citation is worse than none... versionadded::follows the prose, before any section.The class docstring carries only
Attributes(scikit-learn_-suffixed fit-state), documented asname_ : type, shape (...). Stateless classes omit it.Parameters belong in
__init__— never in the class docstring.__init__always has a docstring, orderedParameters → Notes → See Also → Examples.Plot pairs are reciprocal: a Plot class summary reads
Plotting class for :class:`<X>` ... [Key]_.and the logic/plot classes link each other underSee Also.
Method & function docstrings
def run(self, labels=None, ...):
"""
Perform Comparative Physicochemical Profiling (CPP): creation and two-step
filtering of interpretable sequence-based features.
Parameters
----------
labels : array-like, shape (n_samples,)
Class labels for samples in sequence DataFrame (typically, test=1,
reference=0).
Returns
-------
df_feat : pd.DataFrame, shape (n_features, n_feature_info)
Feature DataFrame; output-column schema as ``*`` bullets in Notes.
See Also
--------
* :meth:`CPP.eval`: evaluate the resulting feature set.
Examples
--------
.. include:: examples/cpp_run.rst
"""
Invariants:
Summary is a verb phrase (imperative/present); no
→/+/ arrow shorthand.Summary + description. The one-line summary is followed (after a blank line) by a short plain-language description — what it does in simple words, the cited tool/method
[Key]_if it integrates one, and the key:role:cross-references — before the first section. The same holds for classes (the expanded paragraph after the noun-phrase summary). Trivial one-line accessors may keep just the summary; the checker’sSUMMARY-ONLYis advisory, a prompt to add the description where it helps. Write the description as natural flowing prose; do not prefix it with a bold rhetorical label that names the meta-idea (**Mental model.**,**When to use.**,**What it returns.**) — state the content directly. (Bold for genuine emphasis or a structural*-bullet label inNotesis fine; this rule is only about replacing the explanatory paragraph with a bold heading.)Expand abbreviations on first use. The first time an abbreviation or acronym appears in a docstring, spell it out with the short form in parentheses — Command Line Interface (CLI), Position Weight Matrix (PWM), Find Individual Motif Occurrences (FIMO) — and use the short form afterwards. Each docstring is self-contained, so re-introduce the term in every docstring that uses it. The bold
(**ACRONYM**)in a class summary is the class-level form of this rule. Domain terms that have a.. glossary::entry may instead be linked with:term:. Universally standard forms (e.g. DNA, 3D, ID, CPU, PDB) need no expansion.Returnsis named (name : type) and matches the returned variable. Two type-only idioms are allowed: a bare class name (scikit-learn self-return, e.g.AAclust) and a polymorphicX or Y.Fixed-option parameters use
Literal, notstr. When a parameter accepts a closed, finite set of string options (the values acheck_str_options/membership check validates against), type-hint it in the signature astyping.Literal["a", "b", ...]rather than a barestr— so the allowed set is self-documenting and IDE-checkable. Spell the members as inline string literals:Literalcannot reference theut.Xconstants, even though the runtime validator still uses them. If the parameter also acceptsNone(defaultNoneoraccept_none=True), wrap asOptional[Literal[...]](never putNoneinsideLiteral); if it also accepts non-string values, useUnion[Literal[...], <other>]. In the docstring, type the parameter with numpydoc set notation matching the members —name : {'remove', 'keep', 'gap'}, default='remove'— notname : str. Open or large sets (e.g. system font names) staystr.Explain each option as a nested bullet. When the options are not self-evident from their names, follow the set-notation type line with a short lead-in and a nested
-bullet list — one bullet per option, each naming the option value (in double backticks) followed by:and a concise gloss, e.g. an option'remove'documented as “drop sequences with non-canonical amino acids”. Keep the gloss to what the option is / when to pick it; defer fuller behaviour toNotesand point there with(see Notes). These per-option enumerations are the exception to the ``*``-bullet rule below: option sub-lists under a parameter use-(matchingname/non_canonical_aa/logo_type), whileNotesandSee Alsosection bullets use*.The section header is
Warnings, neverWarns.List items use
*bullets, not-.Every public method ends with
Examples→ exactly one.. include:: examples/<name>.rst(no other path or extension).
Recurring parameters (DRY)
Parameters that appear in many signatures share one baseline sentence; method-specific behaviour is a suffix, never a replacement. Describe the structure first, the use second.
df_seq DataFrame containing an ``entry`` column with unique protein
identifiers and a ``sequence`` column with full protein sequences.
labels Class labels for samples in sequence DataFrame (typically,
test=1, reference=0).
n_jobs Number of CPU cores (>=1) used for multiprocessing. If ``None``,
the number is optimized automatically. If ``-1``, all cores are
used. Overridden by ``options['n_jobs']`` when set.
random_state The seed used by the random number generator. If a positive
integer, results are reproducible; if ``None``, truly random.
To keep these from drifting, define each baseline once and inject it with a
pandas/seaborn-style mechanism (a _shared_docs dict + a @doc(...) /
Substitution decorator) rather than re-typing it. This is the adopted target;
where injection is not yet wired, copy the baseline verbatim.
A docstring is self-contained — document every parameter as its own entry.
DRY means reusing the same sentence, not collapsing parameters into a single
cross-method reference. Never write a lumped entry like
labels, n_filter, n_jobs, ... : See :meth:`run`. Same semantics. — repeat each
parameter (with its baseline sentence) so the reader never has to open another
method to learn what an argument does. The checker’s PARAM-UNDOCUMENTED flags
the signature parameters such a lump leaves effectively undocumented.
Cross-references (See Also)
Roles:
:class:(classes),:meth:(methods),:func:(top-level / scikit-learn functions),:ref:(usage-principles pages),:term:(glossary terms, see below).See Alsois a*-bulleted list; each entry is* :role:`Target`: gloss.— single colon, gloss after: ``. No bare ``name : descnumpydoc entries and no `` : `` (space-colon-space).Every cross-reference must resolve. A
:class:/:meth:/:func:target (inSee Alsoor inline prose) must name a real public symbol —CPP,CPP.run_num,aaanalysis.combine_dict_nums. Watch capitalization (AAlogo, notAALogo) and method names on the right class. The checker’sXREF-UNRESOLVEDflags an internal target that does not resolve; external refs (pandas.DataFrame) are left alone.Order multi-layer links by documentation layer. When a
See Also(or inline prose) links out to other documentation layers, reference them in the order Usage Principles → Tables → Tutorials. Add an external-library reference only when absolutely necessary.
Citations
Cite inline with
[Key]_only. Never inline a full.. [Key] Author, Year, ...reference, a raw URL, or(Author et al. Year)free text.All bibliography entries live in
docs/source/index/references.rst, grouped by topic. Key format: single first-author[AuthorYY]([Song12]), two-author[FirstSecondYY]([ElkanNoto08]), same author/year gets a trailing letter ([Breimann24a]).Pick the few most relevant references per symbol — 1–2 per major method, plus the project paper (
[Breimann25]_) at the class level only for the classes that paper actually describes (the core CPP / dPULearn / TreeModel pipeline). It is not a default stamp: a class the cited work does not cover carries no class-level citation (see the class-summary rule above).Every external tool / method AAanalysis integrates must be cited and explained. When a method wraps or runs an external tool (DSSP, Chainsaw, Merizo, AFragmenter, MEME / FIMO, cd-hit, mmseqs2, logomaker, SHAP, …), name it with a
[Key]_citation (its paper, defined inreferences.rst— verify it exists, never a bare repo URL) and describe what the tool does in one plain sentence. Example:'chainsaw' ([Wells24]_): a fully-convolutional neural network that predicts domain boundaries from a PDB / CIF structure.Never reference internal decision records (ADRs) in docstrings. ADRs (
docs/adr/) are internal to the development process; docstrings are user-facing. If an ADR documents a user-visible choice (why a parameter exists, how an algorithm was selected), extract the why as plain language in the docstring — never cite the ADR number. (Developer-facing notes such asCONTEXT.mdmay cite ADRs; user docstrings may not.)
Versioning & deprecation
.. versionadded:: X.Y.Z(true first-release version) on every public class and function;.. versionchanged:: X.Y.Zwhen behaviour changes.Parameter-level directives — when a parameter or option is added/changed after its class, annotate it inside the parameter description:
return_stats : bool, default=False If ``True``, also return the filter-funnel statistics. .. versionadded:: 1.1.0Deprecation uses
.. deprecated:: X.Y.Zin the docstring plus aDeprecationWarningshim (see api-stability); a renamed/removed public symbol keeps a one-minor-release shim before removal.
Class abbreviations
Every public class has one canonical abbreviation, used identically as the
example-notebook instance variable (aac = aa.AAclust()) and the
example-notebook filename stem (examples/feature_engineering/aac_fit.ipynb).
This keeps the API, the example notebooks, and the rendered docs in lock-step.
The registry below is the single source of truth and is enforced by
tests/unit/api_tests/test_class_abbreviation_registry.py (every public class
is registered; every <abbr> = aa.<Class>() and notebook filename matches).
Rules:
AA*classes keep theaaprefix; acronyms stay whole (CPP→cpp); the established public spelling is kept (dPULearn→dpul).A plot pair is the base abbreviation plus
_plot(CPPPlot→cpp_plot).Legacy/incumbency wins. Existing short forms are kept (
aac,aal); theaaprefix is enforced where missing (AAWindowSampler→aaws); and when two classes would collide the newer one takes the longer form — soSeqMutstaysseqmut, leavingsmfree forShapModel.A class instance is named the bare abbreviation, always —
cpp = aa.CPP(...), nevercpp_res/cpp_dom. If you build the same class repeatedly (e.g. one CPP per prediction level), reassign the bare name and let the outputs carry the qualifier (df_feat_res,X_res). A<abbr>_<qualifier>instance name is allowed only for a genuinely concurrent second instance that cannot be restructured (aaws_strictbesideaaws) — never an unrelated word.
Class |
Abbr. |
Extra |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Output / data-object names
The objects passed between steps have canonical names too — most are defined in
the project glossary (CONTEXT.md). Use these consistently so a snippet reads
the same everywhere; this table is a reference, not a test-enforced gate (only
the class-instance names above are checked).
Variable |
Object (producer) |
|---|---|
|
sequence frame ( |
|
AAontology scales / scale categories ( |
|
assembled parts ( |
|
split specification ( |
|
feature frame, canonical schema ( |
|
feature matrix ( |
|
evaluation results ( |
|
feature positions ( |
|
sequence-logo frames ( |
|
mutation impact ( |
|
PU frame ( |
Qualifiers belong on the data level. A variant of a data object takes a
<name>_<qualifier> suffix (df_feat_res, X_res, df_cat_selected,
df_top15) — used only when you actually have a variant, not stamped onto
every example. Class instances stay the bare abbreviation (see above).
Label parameter names
Each labeling concept has one canonical parameter name, used consistently across the classes a user combines in one workflow. Several names look similar but name different concepts — keep them distinct rather than collapsing them.
Concept |
Canonical name |
Notes |
|---|---|---|
Contrast markers (the two groups being compared) |
|
The positive/test group vs the reference group of a contrast —
|
Single labeling (1D) |
|
One per-sample class-label vector, shape |
List of labelings (2D, multi-dataset) |
|
Several labelings stacked, shape |
Target-class selector (a single class to attribute) |
|
|
Examples & verification
Examples are authored as notebooks under
examples/<subpackage>/and pulled in with.. include:: examples/<name>.rst(converted at docs-build time).Keep them small, seeded, and deterministic. Example notebooks and tutorials are executed in CI (
pytest --nbmake examples/ tutorials/) so they cannot rot; tiny self-contained snippets may additionally use>>>doctests run with--doctest-modules.Commit notebooks with their executed outputs. The docs render the stored cell outputs (
nbsphinx_execute = 'never'inconf.py), andcreate_notebooks_docs.pyconverts each notebook to.rstfrom those saved outputs. A cell with no saved output renders no figure or table on Read the Docs — even though the blocking CI (which only checks that cells run) stays green, so the gap is invisible until you look at the built page. After editing any example/tutorial cell (including programmaticNotebookEdit, which clears the cell’s outputs), re-run the whole notebook and save it with outputs, e.g.jupyter nbconvert --to notebook --execute --inplace examples/<subpackage>/<name>.ipynb, then confirm the figures are embedded before committing.
Notebook content & structure
An example notebook teaches the symbol, it does not merely call it. Use this order:
Concept first (opening markdown cell). In natural prose before any code, explain what the method/class does, when and why to use it, and what it returns. Write it as flowing text — do not prefix the paragraphs with bold rhetorical labels (
**Mental model.**,**When to use.**,**What it returns.**); state the content directly.Minimal example (code cell). The smallest seeded, runnable call, with output.
Parameter walkthrough (markdown + code). Introduce every public parameter, with one cell per parameter group: parameters that belong together share a cell (e.g.
jmd_n_len/jmd_c_len, or a family ofmax_*thresholds). Each group gets a short markdown note (what it controls) and a code cell showing its effect on the result. No parameter may be left uncovered.Show output so the docs render it; keep every cell small, seeded, deterministic. The notebook must be committed with its executed outputs (figures + tables) — see Examples & verification above.
Glossary cross-links
Domain terms (df_seq, dict_num, pseudo-scale, entry, …) are
defined once in a Sphinx .. glossary:: (sourced from the project glossary) and
referenced from docstrings via :term:`dict_num` so a reader can click through
to the canonical definition.
Math
Render formulas with .. math:: (or inline :math:`...`) rather than ASCII,
e.g. in the metrics functions (AUC*, BIC, KLD).
Conformance checklist
A docstring is house-style if every applicable item holds. The right column is
the code emitted by the internal checker. The checker separates defects
(hard violations — the run fails only on these) from advisory notes
(CLASS-NO-CITATION — never fails, since utility classes legitimately omit a
citation), and skips UNDER CONSTRUCTION stubs entirely (a class whose summary
starts UNDER CONSTRUCTION, or a method whose body is just
raise NotImplementedError). 0 defect(s) therefore means the convention is
satisfied for every implemented public symbol.
Rule |
Checker code |
|---|---|
Class summary is a noun phrase (not a verb) |
|
Class summary ends with a |
|
Every |
|
|
|
|
|
|
|
Public symbol has a docstring |
|
Method summary has no arrow shorthand |
|
|
|
Public method ends with an |
|
Recurring params reuse the baseline sentence |
|
Citations use |
|
|
|
|
|
|
|
|
|
Every |
|
|
|
A body that raises documents a |
|
Summary is followed by a plain-language description (advisory) |
|
The build itself is the final gate: cd docs && make html (ideally with
SPHINXOPTS="-W") must finish without warnings — broken section underlines,
inline-literal RST errors, and unresolved .. include:: targets surface only
there, not in the structural checker.
Tooling
# audit the whole package (or a single file)
python .claude/skills/docstrings/scripts/check_docstrings.py aaanalysis/
# auto-fix the safe mechanical drift (Warns→Warnings, See-Also colon spacing)
python .claude/skills/docstrings/scripts/check_docstrings.py --fix aaanalysis/
--fix applies only the mechanical subset; every other finding is an
author-side fix against the templates above.
API reference order
The API reference (docs/source/api.rst) is ordered to read as the analysis
pipeline, not alphabetically. A newly integrated public symbol must be slotted
in by these rules:
Sections follow the workflow: Data Handling → Sequence Analysis → Feature Engineering → PU Learning → Explainable AI → Protein Design → Utility Functions (data in → analyse → engineer features → model → design → helpers).
Within a section, follow data flow: inputs/loaders first, then the processing classes, then combiners/outputs (Data Handling = loaders → preprocessors →
combine_dict_nums).Parallel-modality families go sequence → structure → embedding → annotation (the sequence-to-structure-to-function logic), e.g. the preprocessors:
SequencePreprocessor,StructurePreprocessor,EmbeddingPreprocessor,AnnotationPreprocessor.A logic class is immediately followed by its ``*Plot`` pair, and close variants form one contiguous family:
AAclust→AAclustPlot; the CPP familyCPP→CPPGrid→CPPPlot;dPULearn→dPULearnPlot.Core before pro where modality does not dictate otherwise (
TreeModelbeforeShapModel).Group functions by kind in Utility Functions: the
comp_*metrics together, then display/options, then theplot_*helpers.Every public symbol appears in ``api.rst`` exactly once —
__all__(incl. the commented pro entries) andapi.rstmust match. The checker’sAPI-INDEX-MISSING/API-INDEX-STALEenforces this coverage; placement within the section (rules 1–6) is a human call.