30 Jun 11:12
Re: Bioperl-l Digest, Vol 74, Issue 25
Paola Bisignano <paola.bisignano <at> gmail.com>
2009-06-30 09:12:49 GMT
2009-06-30 09:12:49 GMT
Hi,
I need a little help, to parse a file, but I tried to search some
modules of bioperl, but there are a lot, and I don't know how to
start, I find moduls for all db, for different web site, but not for
my favorite PDBsum....so I parsed a lot of thing on my own, even if I
was new in learning perl....but now I'm waiting for help...because I
need to parse a FASTA file, resulted from aligned sequences...I need
to extract the aligned sequences, only for the pdb in my lista....
my fasta file is like:
Query: /ebi/research/thornton/tmp/sas307986/seq.fasta
1>>>Sequence 3e7e:A - 333 aa
Library: /ebi/research/thornton/www/databases/html/pdbsum/data/pdblib
17840403 residues in 79353 sequences
opt E()
< 20 286 0:===
22 1 0:= one = represents 135 library sequences
24 1 0:=
26 0 2:*
28 21 18:*
30 36 109:*
32 237 421:== *
34 956 1140:========*
36 1924 2342:=============== *
38 3591 3871:=========================== *
40 4904 5400:===================================== *
42 6750 6600:================================================*=
44 7145 7281:=====================================================*
46 8047 7416:======================================================*=====
.........
>>2np8:A (159 aa)
initn: 125 init1: 72 opt: 136 Z-score: 168.6 bits: 38.5 E(): 0.011
Smith-Waterman score: 136; 26.0% identity (57.1% similar) in 154 aa
overlap (59-204:13-153)
10 20 30 40 50 60
Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
::
2np8:A QWALEDFEIGRPLG
10
70 80 90 100 110
Sequen EGAFAQVYEATQNKQKFVL--KVQKPANPWEFYIGTQLMER--LKPSMQH-MFMKFYSAH
.: :..:: : ....::.: :: :. . . :: .. .. ..: ....:.
2np8:A KGKFGNVYLAREKQSKFILALKVLFKAQLEKAGVEHQLRREVEIQSHLRHPNILRLYG--
20 30 40 50 60 70
120 130 140 150 160 170
Sequen LFQNGS--VLVGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEII
:.... :. : ::. .. .. :. . .. .. . :. ..:
2np8:A YFHDATRVYLILEYAPLGTVYRELQKLSKFDEQR-----TATYITELANALSYCHSKRVI
80 90 100 110 120
180 190 200 210 220 230
Sequen HGDIKPDNFILGNGFLEQSAG-LALIDLGQSIDMKLFPKGTIFTAKCETSGFQCVEMLSN
: ::::.:..:: ::: : . :.: :.
2np8:A HRDIKPENLLLG------SAGELKIADFGWSVHAPSSR
130 140 150
240 250 260 270 280 290
Sequen KPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLNIP
300 310 320 330
Sequen DCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
>>2ojg:A (337 aa)
initn: 85 init1: 53 opt: 140 Z-score: 168.1 bits: 39.5 E(): 0.012
Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa
overlap (46-252:1-204)
10 20 30 40 50 60
Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
:..: . . . .. :
2ojg:A FDVGPRYTNLSYI-G
10
70 80 90 100 110
Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN
:::...: : .: .: . ..: .:.: : ....: ....: ...
2ojg:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI
20 30 40 50 60
120 130 140 150 160 170
Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI
.... . ..: :... .::: . . . . : ...: .. .:. ..
2ojg:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV
70 80 90 100 110 120
180 190 200 210 220 230
Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML
.: :.::.:..:.. . : . :.: . . . ..: : .. : ::
2ojg:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML
130 140 150 160 170 180
240 250 260 270 280 290
Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN
..: .. .:: ..:. . ::
2ojg:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA
190 200 210 220 230 240
300 310 320 330
Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
2ojg:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS
250 260 270 280 290 300
2ojg:A DEPIAEAPFKFELDDLPKEKLKELIFEETARFQPG
310 320 330
>>2oji:A (344 aa)
initn: 85 init1: 53 opt: 140 Z-score: 168.0 bits: 39.5 E(): 0.012
Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa
overlap (46-252:5-208)
10 20 30 40 50 60
Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
:..: . . . .. :
2oji:A RGQVFDVGPRYTNLSYI-G
10
70 80 90 100 110
Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN
:::...: : .: .: . ..: .:.: : ....: ....: ...
2oji:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI
20 30 40 50 60 70
120 130 140 150 160 170
Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI
.... . ..: :... .::: . . . . : ...: .. .:. ..
2oji:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV
80 90 100 110 120 130
180 190 200 210 220 230
Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML
.: :.::.:..:.. . : . :.: . . . ..: : .. : ::
2oji:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML
140 150 160 170 180
240 250 260 270 280 290
Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN
..: .. .:: ..:. . ::
2oji:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA
190 200 210 220 230 240
300 310 320 330
Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
2oji:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS
250 260 270 280 290 300
2oji:A DEPIAEAPFKFDMELDDLPKEKLKELIFEETARFQPGY
310 320 330 340
.......
I show a part of the file...if I want for example only that two
alignment? are there moduls to parse...because I've tried to parse
whit regex but....without results
....
If anyone has suggestion for muduls or anything else, I'll be very
happy to learn
thanks
Paola
RSS Feed