5 Dec 2011 15:48
Re: slicing in Bio.PDB.Chain.__getitem__() ?
Hongbo Zhu 朱宏博 <macrozhu <at> gmail.com>
2011-12-05 14:48:25 GMT
2011-12-05 14:48:25 GMT
On Mon, Dec 5, 2011 at 2:50 PM, Peter Cock <p.j.a.cock <at> googlemail.com>wrote: > On Mon, Dec 5, 2011 at 1:38 PM, Hongbo Zhu 朱宏博 <macrozhu <at> gmail.com> wrote: > > > >> Perhaps I misunderstood - I would not want to allow the syntax > >> mychain[(' ', 2, ' '): (' ', 40, ' ')] which is unclear, rather only > allow > >> the user to use mychain[2:41] which requires Python counting. > > > > But even in mychain[2:41], the 2 and 41 should be residue sequence > number. > > Then it is consistent with the current acceptable syntax mychain, > where 2 > > also refers to a sequence number. At the moment, BioPython also > > accepts mychain[(' ', 2, ' ')]. So I think mychain[(' ', 2, ' '): (' ', > 40, > > ' ')] would be just a nature extension of mychain[(' ', 2, ' ')]. > > > > According to the source code, mychain is considered an abbreviation of > > mychain[(' ', 2, ' ')]. Internally, mychain will be translated to > > mychain[(' ', 2, ' ')] by function Bio.PDB.Chain.__translate_id(). So if > > mychain[2:4] would be allowed, internally it would also > > be first translated to mychain[(' ', 2, ' '): (' ', 40, ' ')]. So in my > > point of view, mychain[2:4] is just an abbreviation for mychain[(' ', 2, > ' > > '): (' ', 40, ' ')], just like mychain is a short version of > mychain[(' > > ',2,' ')]. > > > > hongbo > > I've never really liked these strange tuple IDs, which are usually > but not always full of empty values. I understand some of > the corner cases they handle, but they are very complicated. > This seems to be the problem of PDB. I don't know how other packages handle the issue. In addition, I once proposed to remove the HETERO-flag in the residue ID. http://biopython.org/pipermail/biopython-dev/2011-January/008640.html It is only retained for the backwards compatibility with PDB files before remediation in 2007. Removing only HETERO-flag does not solve the problem totally, but to some extent (say, around 50%). I think, one possible solution is to treat residue ID always as a string '%4d%s' % (resnum, icode) instead of a tuple composed of resnum plus icode (if we do not consider HETERO-flag). This way, biotpyhon also serves to remind users that icode is indispensable for uniquely locating a residue even if icode is empty. But this would lead to formidable API change. > You cannot assume 2 will map to (' ', 2, ' ') in general - this > is what the _translate_id method handles. Consider the case > where you have sliced the Chain as discussed, since the > first two elements have been removed, that mapping will shift. > I am afraid I don't concur. As a matter of fact, the mapping are internally fine-tuned by comparing the input residue ID to a list of all residue IDs so the correct index values are obtained for the slicing: res_id_list = [r.id for r in self.get_iterator()] start_index = res_id_list.index(self._translate_id(start)) stop_index = res_id_list.index(self._translate_id(stop)) return self.get_list()[start_index:stop_index:step] It is like mychain, this 2 will be first translated to (' ',2,' ') and this residue ID is searched in the dictionary indexed by all residues IDs to locate the correct residue. > We definitely would need a test case covering non-trivial > ID tuples (e.g. using insertion codes), and tests slicing a > previously sliced Chain. > PDB entry 1h4w is a good example with icode and the sequence of chain A starts with resnum 16. def get_slice(self, start, end, step=None): """Return a slice of the chain from start to end (including end position) Arguments: o start - (string, int, string) or int o end - (string, int, string) or int o step - None or int """ res_id_list = [r.id for r in self.get_iterator()] start_index = res_id_list.index(self._translate_id(start)) end_index = res_id_list.index(self._translate_id(end)) return self.get_list()[start_index:end_index+1:step] In : mychain.get_slice(182,189) Out: [<Residue CYS het= resseq=182 icode= >, <Residue VAL het= resseq=183 icode= >, <Residue GLY het= resseq=184 icode= >, <Residue PHE het= resseq=184 icode=A>, <Residue LEU het= resseq=185 icode= >, <Residue GLU het= resseq=186 icode= >, <Residue GLY het= resseq=187 icode= >, <Residue GLY het= resseq=188 icode= >, <Residue LYS het= resseq=188 icode=A>, <Residue ASP het= resseq=189 icode= >] > Peter > -- -- Hongbo _______________________________________________ Biopython-dev mailing list Biopython-dev <at> lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython-dev