schemarecomb.ParentSequences¶
- class schemarecomb.ParentSequences(records, pdb_structure=None, auto_align=False, prealigned=False)¶
Parent protein sequences for recombinant library design.
This class sets up the data needed to run recombinant design algorithms, e.g. intaking and aligning parental sequences, aligning the parents, and finding a PDB structure or additional parents. Instances of this class are passed into functions further down the schemarecomb pipeline.
Note
The first sequence (records[0]) has special importance, as it’s used to find PDB structures or additional parents. For parent sequence sets with high enough identity (~60%), this should generally not be an issue, but you may get different results by changing the order of the sequences.
- Parameters
records (
list[SeqRecord]) – Parental amino acid sequences for schemarecomb calculations.auto_align (
bool) – If True, the records are aligned upon initialization.prealigned (
bool) – If True, the records are already aligned upon initialization. auto_align and prealigned must not both be True.pdb_structure (
Optional[_PDBStructure]) – Protein Data Bank structure that represents the three- dimensional structure of the aligned parent sequences. Will be automatically renumbered if either auto_align or prealigned is True.
- Attributes
records (list[SeqRecord]) – Parental amino acid sequences for schemarecomb calculations. BioPython SeqRecords contain sequence metadata such as the name of the sequence. The (unaligned) ith sequence may be obtained as a Python string with “str(parents[i].seq)”. Changing this attribute will delete the alignment attribute.
alignment (list[tuple[str, …]]) – Alignment of records, with the aligned strings in the same order as the source records. Present if and only if the instance is aligned. To set this attribute, call the align or set_alignment methods, which will renumber the pdb_structure attribute if present.
p0_aligned (str) – records[0] aligned, calculated from alignment. Present if and only if instance is aligned.
pdb_structure (schemarecomb.PDBStructure) – Protein Data Bank structure that represents the three-dimensional structure of the aligned parent sequences.
Examples
To start, you need a FASTA file with at least one parent sequence. For these examples, this file is “bgl3_sequence.fasta”, which contains the HIS-tagged beta-gluocosidase sequence associated with the Protein Data Bank (PDB) entry “1GNX”.
The easiest, but slowest way to use this class is to let it handle everything through web services. For example, the following script will build a six-parent alignment with roughly 70% identity between the parents and choose the PDB structure closest to the parents.
>>> getfixture('bgl3_mock_namespace') >>> from schemarecomb import ParentSequences >>> fn = 'tests/fixtures/bgl3_1-parent/bgl3_p0.fasta' >>> parents = ParentSequences.from_fasta(fn) >>> parents.obtain_seqs(6, 0.7) # BLAST takes about 10 minutes. >>> # [sr.name for sr in parents.records] >>> parents.align() # MUSCLE takes about a minute. >>> parents.get_PDB() # BLAST takes about 10 minutes. >>> len(parents.records) 6 >>> # The following output is shortened for clarity. >>> parents.p0_aligned '----------------------MHHHHHHMVPAAQQ...WYAEVARTGVLPTA' >>> parents.p0_aligned == parents.pdb_structure.renumbering_seq True
You can skip the slow web queries if you already have a FASTA with the aligned parents and the PDB structure you want to use:
>>> from schemarecomb import ParentSequences >>> from schemarecomb import PDBStructure >>> pdb_fn = 'tests/fixtures/bgl3_full/1GNX.pdb' >>> parents_fn = 'tests/fixtures/bgl3_full/bgl3_sequences_aln.fasta' >>> pdb = PDBStructure.from_pdb_file(pdb_fn) >>> parents = ParentSequences.from_fasta( ... parents_fn, ... pdb_structure=pdb, ... prealigned=True ... ) >>> len(parents.records) 6 >>> parents.p0_aligned 'MHHHHHHMVPAAQQTAMA...RTGVLPTA-----' >>> parents.p0_aligned == parents.pdb_structure.renumbering_seq True
You can also save or load a ParentSequences as a JSON:
>>> from schemarecomb import ParentSequences >>> from schemarecomb import PDBStructure >>> tempdir = getfixture('tmpdir') # pytest jargon, ignore this. >>> pdb_fn = 'tests/fixtures/bgl3_full/1GNX.pdb' >>> parents_fn = 'tests/fixtures/bgl3_full/bgl3_sequences_aln.fasta' >>> pdb = PDBStructure.from_pdb_file(pdb_fn) >>> parents = ParentSequences.from_fasta( ... parents_fn, ... pdb_structure=pdb, ... prealigned=True ... ) >>> parents_fn = tempdir / 'parents.json' >>> parents_json = parents.to_json() >>> with open(parents_fn, 'w') as f: ... f.write(parents_json) ... 339501 >>> with open(parents_fn, 'r') as f: ... parents_json2 = f.read() ... >>> parents2 = ParentSequences.from_json(parents_json2) >>> # parents and parents2 are the same. >>> parents.alignment == parents2.alignment True
- add_from_candidates(candidate_sequences, num_final_sequences, desired_identity=None)¶
Add new parent sequences from list of candidates.
Finds the set of sequences in candidate_sequences that gives the smallest maximum difference between desired_identity and the calculated identity between any two sequences in the set.
- Parameters
candidate_sequences (
list[SeqRecord]) – Sequences to choose from.num_final_sequences (
int) – Number of desired parent sequences. After call, instance should have len(sequences) equal this.desired_identity (
Optional[float]) – Desired identity of new sequences compared to first parent sequence. Default: if there are multiple parent sequences already, this value is taken as the average identity between the first parent and other sequences. Otherwise, 70% identity.
- Raises
ValueError – if num_final_sequences is not greater than len(self.records) or if desired_identity is a float and not between 0.0 and 1.0, exclusive.
- Return type
None
- align(run_locally=False)¶
Use the MUSCLE web service to align the records.
Sets the alignment attribute to the resulting alignment. The run_locally attribute is currently experimental and probably shouldn’t be used.
- Return type
None
- classmethod from_fasta(fasta_fn, **kwargs)¶
Contruct instance from FASTA file.
- Parameters
fasta_fn (
str) – filename of FASTA file, including relative path.**kwargs – Additional keyword args for __init__. For example, you can specify “auto_align=True” in this constructor.
- Return type
- Returns
ParentsSequences instance constructed from input FASTA file.
- classmethod from_json(in_json)¶
Construct instance from JSON.
- Parameters
in_json (
str) – JSON-formatted string representing a ParentSequences.- Return type
- Returns
ParentSequences instance created from in_json.
- get_PDB()¶
Construct from ParentSequences using BLAST and PDB.
The best structure is found by using BLAST to download candidate PDB sequences, then the sequence with the largest minimum identity to the parents is selected and set to the parent_alignment attribute.
- Parameters
parent_aln – Parent alignment used to query the PDB. Note that parent_aln[0] is used in the query and PDBStructure alignment.
- Raises
ValueError – If no matching PDB structure could be found.
- Return type
None
- new_alignment(aligned_sequences)¶
Add aligned sequences from records to instance.
Sets the alignment attribute to the input alignments.
- Parameters
aligned_sequences (
list[str]) – Aligned sequences in the same order as the records attribute.- Raises
ValueError – If provided sequences do not match the sequences in the records attribute.
- Return type
None
- obtain_seqs(num_final_sequences, desired_identity=None)¶
Adds new sequences with BLAST.
- Parameters
num_final_sequences (
int) – Number of desired parent sequences. After call, instance should have len(sequences) equal this.desired_identity (
Optional[float]) – Desired percent identity of new sequences compared to first parent sequence. Default: if there are multiple parent sequences already, this value is taken as the average identity between the first parent and other sequences. Otherwise, 70% identity.
- Raises
ValueError – if num_final_sequences is not greater than len(self.records) or if desired_identity is a float and not between 0.0 and 1.0, exclusive.
- Return type
None
- to_json()¶
Convert instance to a JSON-formatted string.
- Return type
str- Returns
Instance converted to a JSON string.