Home
Reading
Searching
Subscribe
Sponsors
Statistics
Posting
Contact
Spam
Lists
Links
About
Hosting
Filtering
Features Download
Marketing
Archives
FAQ
Blog
 
Gmane
From: Tychele Turner <tturne18 <at> jhmi.edu>
Subject: Re: [Biopython] [Samtools-help] biopython question
Newsgroups: gmane.comp.python.bio.general
Date: Monday 9th April 2012 20:44:55 UTC (over 4 years ago)
Thanks Monica! I will look into your program.

Tychele

________________________________
From: Monica Britton [[email protected]]
Sent: Saturday, April 07, 2012 5:55 PM
To: Tychele Turner
Cc: Mic; Wibowo Arindrarto; samtools-help
Subject: Re: [Samtools-help] [Biopython] biopython question

Hi Tychele:

If your primer is always at the beginning of each sequence, you could treat
it as a barcode. We have a program to cleave barcodes from fastq sequences
that would fit your purpose (see https://github.com/ucdavis-bioinformatics/sabre).

Monica Britton

On Sat, Apr 7, 2012 at 8:05 AM, Tychele Turner
> wrote:
Hi Mic,

I just saw your message regarding Mark Duplicates and the script Bow and I
were discussing which recognizes and cleaves primers.

First off, I'm familiar with Mark Duplicates from Picard and I do use it
for exome data.

However, in this instance I was looking at sequences coming from short
amplicon sequencing. In this instance, marking duplicates is not
appropriate because most of the reads will be duplicates due to the nature
of the bench experiment (in contrast to shotgun sequencing where your
looking at random fragments in which PCR artifacts arise in the PCR steps
post-shearing). In my short amplicon sequence data, the read will start
with the primer sequence and then extend to be a total length of 100
nucleotides. For this reason, I wanted to use a script which could
recognize the primer and ultimately cleave that primer from the read so it
would not go into the rest of the pipeline which would ultimately go to a
variant calling program.

As for your last point of sending other software which cut adapters that's
fine but I'm not cutting adapters I'm looking for primer sequences and
cleaving those. Also, I thought that if Biopython already has such a nice
setup to do this I would use that especially since python is quite
efficient at this task. Hope this helps.

Tychele



From: Mic [[email protected]]
Sent: Friday, April 06, 2012 5:59 AM
To: Wibowo Arindrarto
Cc: samtools-help; [email protected]
Subject: Re: [Samtools-help] [Biopython] biopython question

Hi Bow,
You can remove duplicates in the input file or create a new output file.
With the following commands you create an output file with no duplicates:

$ samtools fixmate t.paired.sorted.bam t.paired.sorted.SamFix.bam

$ java -Xmx8g -jar MarkDuplicates.jar REMOVE_DUPLICATES=true
ASSUME_SORTED=true MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024  
INPUT=t.paired.sorted.bam OUTPUT=t.paired.sorted.rmdulp.bam
METRICS_FILE=t.paired.sorted.bam.metrics

Are adapters and fragments the same?

I found the following software for adapter:
* TagDust - eliminate artifactual sequence from NGS data
http://www.biomedcentral.com/1471-2164/12/382
http://bioinformatics.oxfordjournals.org/content/25/21/2839.full
* FAR: http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php?
title=Main_Page
* Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic
* http://code.google.com/p/cutadapt/
* https://github.com/vsbuffalo/scythe
* http://code.google.com/p/biopieces/wiki/find_adaptor

Thank you in advance.

Cheers,


On Fri, Apr 6, 2012 at 12:20 PM, Wibowo Arindrarto
<[email protected]> wrote:
Hi Mic,

I'm not familiar with picard, but it seems that this program detects
whole duplicate molecules instead of detecting whether a primer is
present in sequences (which may or may not be duplicates). Plus, it
doesn't do any removal ~ it only flags them. So I don't think the two
are comparable.

cheers,
Bow

On Fri, Apr 6, 2012 at 04:06, Mic
> wrote:
> What is the difference to remove primer from the fastq file rather to use
> markDuplicates http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates
on
> an alignment?
>
> Would both ways deliver the same results?
>
> Thank you in advance.
>
>
> On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto
<[email protected]>
> wrote:
>>
>> Hi Tychele,
>>
>> Glad to hear that and thanks for attaching the code as well :).
>>
>> Just one more heads up on the code, the trimming function assumes that
>> for any record sequence, there is only one matching primer sequence at
>> most. If by any random chance a sequence begins with two or more
>> primer sequences, then it will only trim the first primer sequence. So
>> if you still see some primer sequences left in the trimmed sequences,
>> this could be the case and you'll need to modify the code.
>>
>> However, that seems unlikely ~ the current code should suffice.
>>
>> cheers,
>> Bow
>>
>>
>> On Thu, Apr 5, 2012 at 00:12, Tychele Turner
> wrote:
>> > Hi Bow,
>> >
>> > Thank you! This works great. I have attached the final code to the
email
>> > in case it may benefit others.
>> >
>> > Tychele
>> >
>> >
>> > ________________________________________
>> > From: Wibowo Arindrarto
[[email protected]]
>> > Sent: Wednesday, April 04, 2012 2:05 PM
>> > To: Tychele Turner
>> > Cc: [email protected]
>> > Subject: Re: [Biopython] biopython question
>> >
>> > Hi Tychele,
>> >
>> > If I understood correctly, you have a list of primers stored in a file
>> > and you want to trim those primer sequences off your fastq sequences,
>> > correct? One way I could think of is to first store the primers in a
>> > list (since they will be used repeatedly to check every single fastq
>> > sequence).
>> >
>> > Here's the code:
>> >
>> > from Bio import SeqIO
>> >
>> > def trim_primers(records, 'primer_file_name'):
>> >
>> >    # read the primers
>> >    primer_list = []
>> >    with open('primer_file_name', 'r') as source:
>> >      for line in source:
>> >        primer_list.append(line.strip())
>> >
>> >    for record in records:
>> >        # list to check if the sequence begins with any of the primers
>> >        check = [record.seq.startswith(x) for x in primer_list]
>> >        # if any of the primer is present in the beginning of the
>> > sequence, then we trim it off
>> >        if any(check):
>> >            # get index of primer that matches the beginning
>> >            idx = check.index(True)
>> >            len_primer = len(primer_list[idx])
>> >            yield record[len_primer:]
>> >        # otherwise just return the whole record
>> >        else:
>> >            yield record
>> >
>> > and then, you can use the function like so:
>> >
>> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
>> > trimmed_reads = trim_primers(original_reads, 'primer_file_name')
>> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
>> > print "Saved %i reads" % count
>> >
>> > I haven't tested the function, but I suppose that should do the trick.
>> >
>> > Hope that helps :),
>> > Bow
>> >
>> >
>> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner
> wrote:
>> >> Hi,
>> >>
>> >> I have a question regarding one of the biopython capabilities. I
would
>> >> like to trim primers off the end of reads in a fastq file and I found
>> >> wonderful documentation of how to do this on your website as follows:
>> >>
>> >> from Bio import SeqIO
>> >> def trim_primers(records, primer):
>> >>    """Removes perfect primer sequences at start of reads.
>> >>
>> >>    This is a generator function, the records argument should
>> >>    be a list or iterator returning SeqRecord objects.
>> >>    """
>> >>    len_primer = len(primer) #cache this for later
>> >>    for record in records:
>> >>        if record.seq.startswith(primer):
>> >>            yield record[len_primer:]
>> >>        else:
>> >>            yield record
>> >>
>> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
>> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
>> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
>> >> print "Saved %i reads" % count
>> >>
>> >>
>> >>
>> >>
>> >> My question is: Is there a way to loop through a primer file for
>> >> instance instead of looking for only
>> >>
>> >> 'GATGACGGTGT' every primer would be checked and subsequently removed
>> >> from the start of its respective read.
>> >>
>> >> Primer file structured as:
>> >> GATGACGGTGT
>> >> GATGACGGTGA
>> >> GATGACGGCCT
>> >>
>> >> If you have any suggestions it would be greatly appreciated. Thanks.
>> >>
>> >> Tychele
>> >>
>> >>
>> >> _______________________________________________
>> >> Biopython mailing list  - 
[email protected]
>> >> http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
>> _______________________________________________
>> Biopython mailing list  - 
[email protected]
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
>


------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Samtools-help mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/samtools-help




--
Monica Britton
Bioinformatics Analyst
Genome Center and Bioinformatics Core Facility
University of California, Davis
[email protected]

_______________________________________________
Biopython mailing list  -  [email protected]
http://lists.open-bio.org/mailman/listinfo/biopython
 
CD: 14ms