Noah Dowell | 21 Feb 01:01 2013

Sorry for cross post: Trouble creating ShortReadQ object and using writeFastq

Hello All,

I am trying to do some preprocessing quality analysis on PacBio reads.  To make any plots of quality score or
nucleotide frequency one should only look at reads of the same length.  One output from PacBio filtering of
the raw data is a FASTQ file.  The nature of the PacBio reads is that they lack uniformity in their length
dimension.   I am pretty sure the file is okay because the readSeqFile() function in the QRQC package is able
to read the data in and the FASTQSummary object can be used to make plots.  Given the non-uniform length of
reads though those plots show artifacts.  As far as can tell the FASTQSummary object can not be manipulated
like a ShortRead or BioStrings object.    

So I turn to the ShortRead package.  I have also successfully used the readFastq() function in  ShortRead
package to read in Illumina reads of different length and then preprocess or slice-n-dice those reads.  I
am getting the following error though when I try this on PacBio fastq files:
#####################################################################################
> dirPath = "/Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_255/017249_12cells_filtered/data/"
> pat1 = "filteredSTD_12cells.fastq"
> seqs1 = readFastq(dirPath = dirPath, pattern = pat1, file = "filteredSTD_12cells.fastq")
Error: Input/Output
 file(s):
   /Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_255/017249_12cells_filtered/data//filteredSTD_12cells.fastq
 message: line too long /Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_255/017249_12cells_filtered/data//filteredSTD_12cells.fastq:17521

#####################################################################################

So I tried to manually build a ShortReadQ file using the following (there are generally 4 distinct lines in a
FASTQ file):
#workaround to error with too long lines
seqTemp <- readLines("filteredSTD_12cells.fastq")
seqs <- DNAStringSet(seqTemp[c(FALSE, TRUE, FALSE, FALSE)])
ids <- BStringSet(seqTemp[c(TRUE, FALSE, FALSE, FALSE)])
qual <- BStringSet(seqTemp[c(FALSE, FALSE, FALSE, TRUE)])

SeqClean <- ShortReadQ(sread=seqs, quality = qual, id = ids)

#This worked!!:
length(SeqClean) #484232
summary(width(SeqClean))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
50    1368    2153    2785    3567   26940 

#But I get this error when I try to write the file (changing withIds to TRUE or full to TRUE gives same error):
> writeFastq(SeqClean, filepath = "cleanReads.fastq", format = "fastq", mode = "w", full = FALSE,
withIds = FALSE)
Error in function (classes, fdef, mtable)  : 
 unable to find an inherited method for function ‘writeFastq’ for signature ‘"ShortReadQ", "missing"’

#I think something is not quite right with my construction of the ShortReadQ object and specifically the
read IDs
#I wonder if it is because it is slotting the ids into the wrong place:
> head(ids)
 A BStringSet instance of length 6
   width seq
[1]    72  <at> m130205_030957_42152_c100453962550000001523068903201342_s1_p0/21/0_5104
[2]    75  <at> m130205_030957_42152_c100453962550000001523068903201342_s1_p0/53/2430_3193
[3]    75  <at> m130205_030957_42152_c100453962550000001523068903201342_s1_p0/53/3237_6580
[4]    75  <at> m130205_030957_42152_c100453962550000001523068903201342_s1_p0/53/6626_9948
[5]    72  <at> m130205_030957_42152_c100453962550000001523068903201342_s1_p0/54/0_4929
[6]    74  <at> m130205_030957_42152_c100453962550000001523068903201342_s1_p0/81/276_2420
#####################################################################################

My questions:
1. Is there any way to leverage the ability to slice-n-dice a FASTQ data set in the ShortRead package and then
write to a new FASTQ file?

2 Should I just try to add in names to my sequences manually (to the ShortReadQ object) to alleviate the
'mssing' error above?

3. Is there any desire from the developers to merge the id() and names() functions?

Any help/hints would be appreciated.  I apologize for the lack of a toy-reproducible example; I was
unsuccessful in my attempt to create a toy quality score.  Thank you in advance for your assistance and time.

Best,
Noah 

#####################################################################################

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] qrqc_1.12.0          testthat_0.7         xtable_1.7-0         brew_1.0-6           biovizBase_1.6.2     ggplot2_0.9.3       
[7] reshape_0.8.4        plyr_1.8             org.Hs.eg.db_2.8.0   RSQLite_0.11.2       DBI_0.2-5            AnnotationDbi_1.20.3
[13] Biobase_2.18.0       BiocInstaller_1.8.3  ShortRead_1.16.3     latticeExtra_0.6-24  RColorBrewer_1.0-5  
Rsamtools_1.10.2    
[19] lattice_0.20-13      Biostrings_2.26.3    GenomicRanges_1.10.6 IRanges_1.16.5       BiocGenerics_0.4.0  

loaded via a namespace (and not attached):
[1] biomaRt_2.14.0         bitops_1.0-4.2         BSgenome_1.26.1        cluster_1.14.3         colorspace_1.2-1      
[6] dichromat_2.0-0        digest_0.6.2           evaluate_0.4.3         GenomicFeatures_1.10.1 gtable_0.1.2          
[11] Hmisc_3.10-1           hwriter_1.3            labeling_0.1           MASS_7.3-23            munsell_0.4           
[16] parallel_2.15.2        proto_0.3-10           RCurl_1.95-3           reshape2_1.2.2         rtracklayer_1.18.2    
[21] scales_0.2.3           stats4_2.15.2          stringr_0.6.2          tools_2.15.2           XML_3.95-0.1          
[26] zlibbioc_1.4.0        

_______________________________________________
Bioconductor mailing list
Bioconductor@...
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


Gmane