cogent3.core.alignment.SequenceCollection#

class SequenceCollection(*, seqs_data: SeqsDataABC, moltype: c3_moltype.MolType[Any], info: dict[str, Any] | InfoClass | None = None, source: PathType | None = None, annotation_db: AnnotationDbABC | list[AnnotationDbABC] | None = None, name_map: Mapping[str, str] | None = None, is_reversed: bool = False)#

A container of unaligned sequences.

Attributes:
annotation_db

the annotation database for the collection

modified

collection is a modification of underlying storage

name_map

returns mapping of seq names to parent seq names

names

returns the names of the sequences in the collection

num_seqs

the number of sequences in the collection

seqs

iterable of sequences in the collection

storage

the unaligned sequence storage instance of the collection

Methods

add_feature(*[, seqid, parent_id, strand])

add feature on named sequence

add_seqs(seqs, **kwargs)

Returns new collection with additional sequences.

apply_pssm([pssm, path, background, ...])

scores sequences using the specified pssm

copy_annotations(seq_db)

copy annotations into attached annotation db

count_ambiguous_per_seq()

Counts of ambiguous characters per sequence.

count_kmers([k, use_hook])

return kmer counts for each sequence

counts([motif_length, include_ambiguity, ...])

counts of motifs

counts_per_seq([motif_length, ...])

counts of motifs per sequence

degap([storage_backend])

returns collection sequences without gaps or missing characters.

distance_matrix([calc])

Estimated pairwise distance between sequences

dotplot([name1, name2, window, threshold, ...])

make a dotplot between two sequences.

drop_duplicated_seqs()

returns self without duplicated sequences

duplicated_seqs()

returns the names of duplicated sequences

entropy_per_seq([motif_length, ...])

Returns the Shannon entropy per sequence.

from_rich_dict(data)

returns a new instance from a rich dict

get_ambiguous_positions()

Returns dict of seq:{position:char} for ambiguous chars.

get_features(*[, seqid, biotype, name, ...])

yields Feature instances

get_identical_sets([mask_degen])

returns sets of names for sequences that are identical

get_lengths([include_ambiguity, allow_gap])

returns sequence lengths as a dict of {seqid: length}

get_motif_probs([alphabet, ...])

Return a dictionary of motif probs, calculated as the averaged frequency across sequences.

get_seq(seqname[, copy_annotations])

Return a Sequence object for the specified seqname.

get_seq_names_if(f[, negate])

Returns list of names of seqs where f(seq) is True.

get_similar(target, min_similarity, ...)

Returns new SequenceCollection containing sequences similar to target.

get_translation([gc, incomplete_ok, ...])

translate sequences from nucleic acid to protein

has_annotation_db()

returns True if self has annotation db

has_terminal_stop([gc, strict])

Returns True if any sequence has a terminal stop codon.

is_ragged()

rerturns True if sequences are of different lengths

iter_seqs([seq_order])

Iterates over sequences in the collection, in order.

make_feature(*, feature, **kwargs)

create a feature on named sequence, or on the collection itself

pad_seqs([pad_length])

Returns copy in which sequences are padded with the gap character to same length.

probs_per_seq([motif_length, ...])

return frequency array of motifs per sequence

rc()

Returns the reverse complement of all sequences in the collection.

renamed_seqs(renamer)

Returns new collection with renamed sequences.

replace_annotation_db(value[, check])

public interface to assigning the annotation_db

reverse_complement()

Returns the reverse complement of all sequences in the collection.

set_repr_policy([num_seqs, num_pos, ...])

specify policy for repr(self)

strand_symmetry([motif_length])

returns dict of strand symmetry test results per seq

take_seqs(names[, negate, copy_annotations])

Returns new collection containing only specified seqs.

take_seqs_if(f[, negate])

Returns new collection containing seqs where f(seq) is True.

to_dict(-> dict[str, str]  -> dict[str, str])

Return a dictionary of sequences.

to_dna()

returns copy of self as a collection of DNA moltype seqs

to_fasta([block_size])

Return collection in Fasta format.

to_html([name_order, wrap, limit, colors, ...])

returns html with embedded styles for sequence colouring

to_json()

returns json formatted string

to_moltype(moltype)

returns copy of self with changed moltype

to_rich_dict()

returns a json serialisable dict

to_rna()

returns copy of self as a collection of RNA moltype seqs

trim_stop_codons([gc, strict])

Removes any terminal stop codons from the sequences

write(filename[, format_name])

Write the sequences to a file, preserving order of sequences.

Notes

Should be constructed using make_unaligned_seqs().