Tychele Turner | 9 Apr 22:44 2012

Re: [Biopython] [Samtools-help] biopython question

Thanks Monica! I will look into your program.

Tychele

________________________________
From: Monica Britton [mtbritton <at> ucdavis.edu]
Sent: Saturday, April 07, 2012 5:55 PM
To: Tychele Turner
Cc: Mic; Wibowo Arindrarto; samtools-help
Subject: Re: [Samtools-help] [Biopython] biopython question

Hi Tychele:

If your primer is always at the beginning of each sequence, you could treat it as a barcode. We have a program
to cleave barcodes from fastq sequences that would fit your purpose (see https://github.com/ucdavis-bioinformatics/sabre).

Monica Britton

On Sat, Apr 7, 2012 at 8:05 AM, Tychele Turner <tturne18 <at> jhmi.edu<mailto:tturne18 <at> jhmi.edu>> wrote:
Hi Mic,

I just saw your message regarding Mark Duplicates and the script Bow and I were discussing which recognizes
and cleaves primers.

First off, I'm familiar with Mark Duplicates from Picard and I do use it for exome data.

However, in this instance I was looking at sequences coming from short amplicon sequencing. In this
instance, marking duplicates is not appropriate because most of the reads will be duplicates due to the
nature of the bench experiment (in contrast to shotgun sequencing where your looking at random fragments
in which PCR artifacts arise in the PCR steps post-shearing). In my short amplicon sequence data, the read
will start with the primer sequence and then extend to be a total length of 100 nucleotides. For this
reason, I wanted to use a script which could recognize the primer and ultimately cleave that primer from
the read so it would not go into the rest of the pipeline which would ultimately go to a variant calling program.

As for your last point of sending other software which cut adapters that's fine but I'm not cutting adapters
I'm looking for primer sequences and cleaving those. Also, I thought that if Biopython already has such a
nice setup to do this I would use that especially since python is quite efficient at this task. Hope this helps.

Tychele

From: Mic [mictadlo <at> gmail.com<mailto:mictadlo <at> gmail.com>]
Sent: Friday, April 06, 2012 5:59 AM
To: Wibowo Arindrarto
Cc: samtools-help; biopython <at> biopython.org<mailto:biopython <at> biopython.org>
Subject: Re: [Samtools-help] [Biopython] biopython question

Hi Bow,
You can remove duplicates in the input file or create a new output file. With the following commands you
create an output file with no duplicates:

$ samtools fixmate t.paired.sorted.bam t.paired.sorted.SamFix.bam

$ java -Xmx8g -jar MarkDuplicates.jar REMOVE_DUPLICATES=true ASSUME_SORTED=true
MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1024   INPUT=t.paired.sorted.bam
OUTPUT=t.paired.sorted.rmdulp.bam METRICS_FILE=t.paired.sorted.bam.metrics

Are adapters and fragments the same?

I found the following software for adapter:
* TagDust - eliminate artifactual sequence from NGS data
http://www.biomedcentral.com/1471-2164/12/382
http://bioinformatics.oxfordjournals.org/content/25/21/2839.full
* FAR: http://sourceforge.net/apps/mediawiki/theflexibleadap/index.php?
title=Main_Page
* Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic
* http://code.google.com/p/cutadapt/
* https://github.com/vsbuffalo/scythe
* http://code.google.com/p/biopieces/wiki/find_adaptor

Thank you in advance.

Cheers,

On Fri, Apr 6, 2012 at 12:20 PM, Wibowo Arindrarto
<w.arindrarto <at> gmail.com<mailto:w.arindrarto <at> gmail.com>> wrote:
Hi Mic,

I'm not familiar with picard, but it seems that this program detects
whole duplicate molecules instead of detecting whether a primer is
present in sequences (which may or may not be duplicates). Plus, it
doesn't do any removal ~ it only flags them. So I don't think the two
are comparable.

cheers,
Bow

On Fri, Apr 6, 2012 at 04:06, Mic <mictadlo <at> gmail.com<mailto:mictadlo <at> gmail.com>> wrote:
> What is the difference to remove primer from the fastq file rather to use
> markDuplicates http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates on
> an alignment?
>
> Would both ways deliver the same results?
>
> Thank you in advance.
>
>
> On Thu, Apr 5, 2012 at 8:26 AM, Wibowo Arindrarto <w.arindrarto <at> gmail.com<mailto:w.arindrarto <at> gmail.com>>
> wrote:
>>
>> Hi Tychele,
>>
>> Glad to hear that and thanks for attaching the code as well :).
>>
>> Just one more heads up on the code, the trimming function assumes that
>> for any record sequence, there is only one matching primer sequence at
>> most. If by any random chance a sequence begins with two or more
>> primer sequences, then it will only trim the first primer sequence. So
>> if you still see some primer sequences left in the trimmed sequences,
>> this could be the case and you'll need to modify the code.
>>
>> However, that seems unlikely ~ the current code should suffice.
>>
>> cheers,
>> Bow
>>
>>
>> On Thu, Apr 5, 2012 at 00:12, Tychele Turner <tturne18 <at> jhmi.edu<mailto:tturne18 <at> jhmi.edu>> wrote:
>> > Hi Bow,
>> >
>> > Thank you! This works great. I have attached the final code to the email
>> > in case it may benefit others.
>> >
>> > Tychele
>> >
>> >
>> > ________________________________________
>> > From: Wibowo Arindrarto [w.arindrarto <at> gmail.com<mailto:w.arindrarto <at> gmail.com>]
>> > Sent: Wednesday, April 04, 2012 2:05 PM
>> > To: Tychele Turner
>> > Cc: biopython <at> biopython.org<mailto:biopython <at> biopython.org>
>> > Subject: Re: [Biopython] biopython question
>> >
>> > Hi Tychele,
>> >
>> > If I understood correctly, you have a list of primers stored in a file
>> > and you want to trim those primer sequences off your fastq sequences,
>> > correct? One way I could think of is to first store the primers in a
>> > list (since they will be used repeatedly to check every single fastq
>> > sequence).
>> >
>> > Here's the code:
>> >
>> > from Bio import SeqIO
>> >
>> > def trim_primers(records, 'primer_file_name'):
>> >
>> >    # read the primers
>> >    primer_list = []
>> >    with open('primer_file_name', 'r') as source:
>> >      for line in source:
>> >        primer_list.append(line.strip())
>> >
>> >    for record in records:
>> >        # list to check if the sequence begins with any of the primers
>> >        check = [record.seq.startswith(x) for x in primer_list]
>> >        # if any of the primer is present in the beginning of the
>> > sequence, then we trim it off
>> >        if any(check):
>> >            # get index of primer that matches the beginning
>> >            idx = check.index(True)
>> >            len_primer = len(primer_list[idx])
>> >            yield record[len_primer:]
>> >        # otherwise just return the whole record
>> >        else:
>> >            yield record
>> >
>> > and then, you can use the function like so:
>> >
>> > original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
>> > trimmed_reads = trim_primers(original_reads, 'primer_file_name')
>> > count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
>> > print "Saved %i reads" % count
>> >
>> > I haven't tested the function, but I suppose that should do the trick.
>> >
>> > Hope that helps :),
>> > Bow
>> >
>> >
>> > On Wed, Apr 4, 2012 at 18:55, Tychele Turner <tturne18 <at> jhmi.edu<mailto:tturne18 <at> jhmi.edu>> wrote:
>> >> Hi,
>> >>
>> >> I have a question regarding one of the biopython capabilities. I would
>> >> like to trim primers off the end of reads in a fastq file and I found
>> >> wonderful documentation of how to do this on your website as follows:
>> >>
>> >> from Bio import SeqIO
>> >> def trim_primers(records, primer):
>> >>    """Removes perfect primer sequences at start of reads.
>> >>
>> >>    This is a generator function, the records argument should
>> >>    be a list or iterator returning SeqRecord objects.
>> >>    """
>> >>    len_primer = len(primer) #cache this for later
>> >>    for record in records:
>> >>        if record.seq.startswith(primer):
>> >>            yield record[len_primer:]
>> >>        else:
>> >>            yield record
>> >>
>> >> original_reads = SeqIO.parse("SRR020192.fastq", "fastq")
>> >> trimmed_reads = trim_primers(original_reads, "GATGACGGTGT")
>> >> count = SeqIO.write(trimmed_reads, "trimmed.fastq", "fastq")
>> >> print "Saved %i reads" % count
>> >>
>> >>
>> >>
>> >>
>> >> My question is: Is there a way to loop through a primer file for
>> >> instance instead of looking for only
>> >>
>> >> 'GATGACGGTGT' every primer would be checked and subsequently removed
>> >> from the start of its respective read.
>> >>
>> >> Primer file structured as:
>> >> GATGACGGTGT
>> >> GATGACGGTGA
>> >> GATGACGGCCT
>> >>
>> >> If you have any suggestions it would be greatly appreciated. Thanks.
>> >>
>> >> Tychele
>> >>
>> >>
>> >> _______________________________________________
>> >> Biopython mailing list  -  Biopython <at> lists.open-bio.org<mailto:Biopython <at> lists.open-bio.org>
>> >> http://lists.open-bio.org/mailman/listinfo/biopython
>> >
>>
>> _______________________________________________
>> Biopython mailing list  -  Biopython <at> lists.open-bio.org<mailto:Biopython <at> lists.open-bio.org>
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
>

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
Samtools-help mailing list
Samtools-help <at> lists.sourceforge.net<mailto:Samtools-help <at> lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/samtools-help

--
Monica Britton
Bioinformatics Analyst
Genome Center and Bioinformatics Core Facility
University of California, Davis
mtbritton <at> ucdavis.edu<mailto:mtbritton <at> ucdavis.edu>

_______________________________________________
Biopython mailing list  -  Biopython <at> lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


Gmane