14 Nov 2012 16:51
Re: How to annotate genomic coordinates
Dear Valerie,
*I've tried your last approach in the following way:*
library(Mus.musculus)
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
library(VariantAnnotation)
txdb <- TxDb.Mmusculus.UCSC.mm9.knownGene
y <- read.delim("/path_to_file/ids_noM.txt", sep=".", header=FALSE, as.is
=TRUE)
gr<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width = 20))
#gr <- GRanges(c("chr1", "chr17", "chr16"), IRanges(c(41245606, 31245606,
11248806), width = 20))
loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())
head(loc)
idx <- loc[ ,c("QUERYID", "GENEID")]
idx <- idx[!duplicated(idx)] #no duplicar ids
head(idx)
geneid <- idx[!is.na(idx$GENEID)] #Remove NA GENEID's and call select().
genetable <- select(Mus.musculus, keytype="ENTREZID",
keys=geneid$GENEID, cols=c("GENENAME", "ENSEMBL"))
head(genetable)
id <- paste(as.character(seqnames(geneid)), ".", start(geneid), sep="")
data.frame(id, genetable)
* and I get this error*:
*Error in data.frame(id, genetable) :
arguments imply differing number of rows: 180912, 12962*
It seems we are losing rows in the process,but I have no clue where or why
this happens...
Please forgive me for being still so unable to catch R file processing
dynamics properly.
Thank you again
2012/11/14 Valerie Obenchain <vobencha@...>
> Hello,
>
> Are you familiar with the R help pages? For any function or class object
> you can type ? followed by the name.
>
> ?locateVariants
> ?select
>
> The page for locateVariants explains each of the the metadata columns in
> the result. One of the columns is 'QUERYID'. This id represents the row in
> your original query (i.e., the GRanges you entered as the 'query'
> argument). The page also explains that the result has one row for each
> range-transcript match.
>
> gr <- GRanges(c("chr1", "chr17", "chr16"), IRanges(c(41245606, 31245606,
> 11248806), width = 20))
>
> loc <- locateVariants(query=gr, subject=txdb, region=AllVariants())
>
> The first range hit an intergenic region and we therefore have PRECEDEID
> and FOLLOWID. The second range hit a coding region in a single transcript
> and the third range hit an intron in two different transcripts. Because the
> third range hit 2 transcripts we have 2 rows of data, each has a unique
> TXID but the same GENEID.
>
> > loc
> GRanges with 4 ranges and 7 metadata columns:
>
> seqnames ranges strand | LOCATION QUERYID TXID
> <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
> [1] chr1 [41245606, 41245625] * | intergenic 1 <NA>
> [2] chr17 [31245606, 31245625] * | coding 2 47716
> [3] chr16 [11248806, 11248825] * | intron 3 46316
> [4] chr16 [11248806, 11248825] * | intron 3 46317
>
> CDSID GENEID PRECEDEID FOLLOWID
> <integer> <character> <character> <character>
> [1] <NA> <NA> 211798 381339
> [2] 184246 11307 <NA> <NA>
> [3] <NA> 14852 <NA> <NA>
> [4] <NA> 14852 <NA> <NA>
>
>
> Pull out the unique combinations of GENEID and QUERYID.
>
> idx <- loc[ ,c("QUERYID", "GENEID")]
> idx <- idx[!duplicated(idx)]
> > idx
> GRanges with 3 ranges and 2 metadata columns:
> seqnames ranges strand | QUERYID GENEID
> <Rle> <IRanges> <Rle> | <integer> <character>
> [1] chr1 [41245606, 41245625] * | 1 <NA>
> [2] chr17 [31245606, 31245625] * | 2 11307
> [3] chr16 [11248806, 11248825] * | 3 14852
>
> Remove NA GENEID's and call select().
> geneid <- idx[!is.na(idx$GENEID)]
> genetable <- select(Mus.musculus, keytype="ENTREZID", keys=geneid$GENEID,
> cols=c("GENENAME", "ENSEMBL"))
> > genetable
>
> ENTREZID ENSEMBL
> 1 11307 ENSMUSG00000024030
> 2 14852 ENSMUSG00000062203
>
> GENENAME
> 1 ATP-binding cassette, sub-family G (WHITE), member 1
> 2 G1 to S phase transition 1
>
> You can use the QUERYID to map back to an original list of identifiers. If
> you just need 'chromosome.start' information you can extract that from the
> ranges in geneid. Note that the QUERYID maps back to the original 'query'
> but those same ranges are also included in the result.
> id <- paste(as.character(seqnames(**geneid)), ".", start(geneid), sep="")
> > data.frame(id, genetable)
> id ENTREZID ENSEMBL
> 1 chr17.31245606 11307 ENSMUSG00000024030
> 2 chr16.11248806 14852 ENSMUSG00000062203
>
> GENENAME
> 1 ATP-binding cassette, sub-family G (WHITE), member 1
> 2 G1 to S phase transition 1
>
>
> Valerie
>
>
>
> On 11/14/12 01:00, José Luis Lavín wrote:
>
>> Dear Valerie,
>>
>> I have an issue with the results of your approach. Now I get a table with
>> the following fields:
>>
>> ENTREZID ENSEMBL GENENAME
>>
>> 1 216285 ENSMUSG00000036602 ALX homeobox 1
>>
>>
>> But the Identifier I converted to obtain the previously mentioned data is
>> not present in that table (eg. chr10.100837476) , so there's nothing I can
>> do with this results table because I can't correlate my previous
>> chr.location type identifiers to the results of the table.
>> I guess I've done something wrong somewhere in the code I modified so
>> that I missed my "chr.coordinate" ids column. So I'll paste the code here
>> again to see if any of you can tell me where I lost that column.
>>
>> source("http://bioconductor.**org/biocLite.R<http://bioconductor.org/biocLite.R>
>> ")
>> biocLite('Mus.musculus')
>> biocLite('TxDb.Mmusculus.UCSC.**mm9.knownGene')
>>
>> biocLite('VariantAnnotation')
>> library(Mus.musculus)
>> library(TxDb.Mmusculus.UCSC.**mm9.knownGene)
>>
>> txdb <- TxDb.Mmusculus.UCSC.mm9.**knownGene
>> y <- read.delim("/path_to_dir/ids_**noM.txt", sep=".", header=FALSE,
>> as.is <http://as.is>=TRUE)
>>
>> probes<- GRanges(seqnames=y[,1], ranges=IRanges(start=y[,2], width=1))
>>
>> library(VariantAnnotation)
>> loc <- locateVariants(query=probes, subject=txdb, region=AllVariants())
>> head(loc)
>>
>> entrezid <- loc$GENEID
>> entrezid2<-select(Mus.**musculus, keytype="ENTREZID", keys=entrezid,
>> cols=c("GENENAME", "ENSEMBL"))
>>
>> write.table(entrezid2, file = "Annotated_Ids.csv",quote = FALSE, sep =
>> ",", eol = "\n")
>>
>> Thank you again for your advice
>>
>>
>> 2012/11/14 Hervé Pagès <hpages@... <mailto:hpages@...>>
>>
>>
>> Hi all,
>>
>>
>> On 11/12/2012 10:24 AM, Valerie Obenchain wrote:
>>
>> Hi Jose,
>>
>> Here is a slightly different approach.
>>
>> library(Mus.musculus)
>> library(TxDb.Mmusculus.UCSC.**mm9.knownGene)
>> txdb <- TxDb.Mmusculus.UCSC.mm9.**knownGene
>>
>> ## I assume you've figured out how to read your data into a
>> GRanges.
>> ## Here we use a small example.
>> gr <- GRanges(seq = "chr17", IRanges(start = 31245606, width =
>> 20))
>>
>> ## The locateVariants() function in the VariantAnnotation
>> package will
>> ## give you the GENEIDs for all ranges in 'query'. If the
>> range does not
>> ## fall in a gene (i.e., it is intergenic), then the PRECEDEID and
>> ## FOLLOWID are provided. The LOCATION column indicates what
>> ## region the range fell in. See ?locateVariants for details
>> on the
>> ## different regions and the ability to set 'upstream' and
>> 'downstream'
>> ## values for promoter regions.
>> library(VariantAnnotation)
>> loc <- locateVariants(query=gr, subject=txdb,
>> region=AllVariants())
>>
>> > loc
>> GRanges with 1 range and 7 metadata columns:
>> seqnames ranges strand | LOCATION
>> QUERYID TXID
>> <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
>> [1] chr17 [31245606, 31245625] * | coding
>> 1 47716
>> CDSID GENEID PRECEDEID FOLLOWID
>> <integer> <character> <character> <character>
>> [1] 184246 11307 <NA> <NA>
>>
>>
>> Nice! So IIUC locateVariants() is not intrinsically for variants but
>> seems to be generally useful for "locating" any set of genomic
>> positions. Shouldn't we have this (or something similar) in
>> GenomicFeatures (probably with a different name)?
>>
>> Cheers,
>> H.
>>
>>
>>
>>
>> ## Use the select() function to map these GENEID's to the other
>> ## values you are interested in. The GENEID's from
>> locateVariants()
>> ## are Entrez ID's so we use those as our 'keytype'.
>> entrezid <- loc$GENEID
>> select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
>> cols=c("GENENAME", "ENSEMBL"))
>> > select(Mus.musculus, keytype="ENTREZID", keys=entrezid,
>> cols=c("GENENAME", "ENSEMBL"))
>> ENTREZID ENSEMBL
>> 1 11307 ENSMUSG00000024030
>> GENENAME
>> 1 ATP-binding cassette, sub-family G (WHITE), member 1
>>
>>
>> Valerie
>>
>>
>>
>> On 11/12/12 06:59, José Luis Lavín wrote:
>>
>> Hello all,
>>
>> First of all I want to thank everybody that gave me advice
>> on this
>> subject.
>> trying to follow the advice, I added some modifications
>> mixing codes from
>> Tim, Harris and James , but it seems I got lost somewhere
>> in between...
>> I'm sorry for bothering you all again, but I'm afraid I
>> need some more
>> help.
>>
>> I need to read my ids.txt file and then iterate use each
>> line of id
>> (chr.position) to perform the annotation process with it.
>> I thought of a
>> "for" loop to achieve it, but I do not seem to catch the
>> essence of R
>> processes and I guess I miss my tryout.
>> Please have a look at my "disaster" and help me with it If
>> such thing is
>> possible...
>>
>> biocLite('Mus.musculus')
>> require(Mus.musculus)
>> require(TxDb.Mmusculus.UCSC.**mm9.knownGene)
>> txdb<- transcriptsBy(TxDb.Mmusculus.**UCSC.mm9.knownGene,
>> 'gene')
>> egid<- names(txdb)
>> name<- unlist(mget(egid, org.Mm.egSYMBOL, ifnotfound=NA))
>> length(name) == length(egid) ## TRUE, OK to use
>> esbl<- unlist(mget(egid, org.Mm.egENSEMBL, ifnotfound=NA))
>> length(esbl) == length(egid) ## FALSE, do not use
>>
>> #read input table
>> coordTable<- read.delim(/path_to_dir/ids.**txt, sep=".",
>> header=FALSE,
>> as.is <http://as.is>
>>
>> =TRUE)
>>
>> for(i in 1:length(coordTable))
>> {
>> probes<- GRanges(seq= "y[,1]", IRanges(start = y[,2],
>> width = 1),
>> strand='*')
>> }
>>
>> genome(probes)<- 'mm9' ## prevents some stoopid mistakes
>>
>> geneBodyProbes<- findOverlaps(probes, txdb)
>> geneBodyProbes
>>
>> write.table
>> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=**
>> FALSE,sep="\t")
>>
>> ## Hits of length 1
>> ## queryLength: 1
>> ## subjectLength: 21761
>> ## queryHits subjectHits
>> ##<integer> <integer>
>> ## 1 1 1641
>> ##
>>
>> name[subjectHits(**geneBodyProbes)]
>>
>> ## 11307 # egid
>> ## "Abcg1" # name
>> ##
>>
>> org.Mm.egCHRLOC[['11307']]
>>
>> ## 17
>> ## 31057694
>> ##
>>
>> org.Mm.egENSEMBL[['11307']]
>>
>> ## [1] "ENSMUSG00000024030"
>>
>> promotersByGene<- flank(txdb, 1500) # or however many
>> bases you want
>> promotersByGene<- flank(promotersByGene, 200, FALSE) # a
>> little extra
>> promoterProbes<- findOverlaps(probes, promotersByGene)
>> promoterProbes
>>
>> write.table
>> (promoterProbes,file="/path_**to_dir/promo_trans_id.tsv",**
>> quote=FALSE,sep="\t")
>>
>>
>> ## Hits of length 0
>> ## queryLength: 1
>> ## subjectLength: 21761
>> ##
>> ## read the GRanges and GenomicFeatures vignettes for more
>>
>> write.table
>> (geneBodyProbes,file="/path_**to_dir/trans_id.tsv",quote=**
>> FALSE,sep="\t")
>>
>>
>> *Thanks in advance*
>>
>> JL
>>
>> 2012/11/8 Harris A. Jaffee<hj@... <mailto:hj@...>>
>>
>>
>> On the elementary end of all this...
>>
>> If the sites are on a *file*, one per line, you could do
>>
>> y<- read.delim(filename, sep=".",
>> header=FALSE, as.is <http://as.is>=TRUE)
>>
>>
>> etc.
>>
>> On Nov 8, 2012, at 11:38 AM, James W. MacDonald wrote:
>>
>> Hi Jose,
>>
>> On 11/8/2012 10:28 AM, José Luis Lavín wrote:
>>
>> Dear James,
>>
>> To answer your questions swiftly, the features
>> are methylation sites
>>
>> (that's why I only have a coordinate instead of having
>> a Start/End
>> pair) in
>> mouse (mm9) genome and I have a list of those in the
>> following format:
>>
>> chr17.31246506
>>
>> How could I read the list so that I can input
>> the "chr" and the
>>
>> "coordinate" parameters into the expression you
>> recommended?
>>
>> First you need to split your data to chr and pos.
>>
>> chr.pos<- do.call("rbind", strsplit(<your vector
>> of chr17.pos data>,
>>
>> "\\."))
>>
>> Note that your vector of chr.pos data may have
>> been in as a factor, so
>>
>> you may need to wrap with an as.character(). If you
>> don't know what I
>> mean
>> by that, you should first read 'An Introduction to R' (
>> http://cran.r-project.org/doc/**manuals/R-intro.html<http://cran.r-project.org/doc/manuals/R-intro.html>
>> ).
>> That will be time
>> well spent.
>>
>> x<- GRanges(seq = "chr17", IRanges(start =
>> 31245606, width = 1))
>>
>> Both the seqnames and ranges argument to GRanges()
>> can be based on
>>
>> vectors. So if you had a matrix (called 'y') like
>>
>> chr16 3423543
>> chr3 403992
>> chr18 3494202
>>
>> then you can do
>>
>> x<- GRanges(y[,1], IRanges(start = y[,2], width = 1))
>>
>> see ?GRanges for more info.
>>
>> As a side note, IIRC, methylation sites are not in
>> general found in
>>
>> exons, but are more likely to be found upstream of a
>> given gene. You
>> might
>> then want to use fiveUTRsByTranscript() rather than
>> exonsBy(). See
>> ?exonsBy
>> after loading the GenomicFeatures package.
>>
>> I 'm lookin forward to obtain a table where my
>> "coordinate-based Id"
>>
>> appears in a column, and the gene_name and if
>> possible, the
>> Entrez/Ensembl
>> Ids in two other columns :
>>
>> Coordinate Gene_name Entrez_ID Ensembl_ID
>>
>> Is it easy to do this in R?
>>
>> Of course! Everything is easy to do in R. You
>> should see my sweet R
>>
>> toast 'n coffee maker ;-D
>>
>> But you should note that the fiveUTRByTranscript()
>> function is
>>
>> transcript based (obvs), and so you will have multiple
>> transcripts per
>> gene. This makes things much more difficult, as a
>> given methylation site
>> may overlap multiple transcripts of the same gene. So
>> that makes it
>> hard to
>> merge your original data with the overlapping transcripts.
>>
>> You could do something like
>>
>> ex<-
>> fiveUTRsByTranscript(TxDb.**
>> Mmusculus.UCSC.mm10.knownGene,
>> use.names
>>
>> = TRUE)
>>
>> ex2<- do.call("rbind", sapply(ex[ex %in% x],
>> elementMetadata))$exon_id
>>
>> then you could use unique Gene IDs thusly
>>
>> my.trans<- select(Mus.musculus,
>> unique(as.character(ex2)),
>>
>> c("CHR","CHRLOC","ENTREZID","**ENSEMBL"), "ENTREZID")
>>
>> That should at least give you a start. See where
>> you can go on your
>> own,
>>
>> and let us know if you get stuck.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> I'm still really new to R and I lack many
>> concepts you may consider
>>
>> basic... I'm awfully sorry
>>
>> Best
>>
>> JL
>>
>> 2012/11/8 James W. MacDonald<jmacdon@...
>> <mailto:jmacdon@...><**mailto:jmacdon@...
>>
>> <mailto:jmacdon@...>>>
>>
>> Hi Jose,
>>
>>
>> On 11/8/2012 8:19 AM, José Luis Lavín wrote:
>>
>> Dear Bioconductor list,
>>
>> I write you this email asking for a
>> Bioconductor module that
>> allows me to
>> annotate genomic coordinates and get
>> different GeneIds.
>> I'll show you an example of what I'm
>> referring to:
>>
>> I have this data:
>> Chromosome coordinate
>> chr17 31246506
>>
>>
>> It depends on what that coordinate is. Is
>> it the start of a
>> transcript? A SNP? Do you really just have
>> a single coordinate, or
>> do you have a range? What species are we
>> talking about here?
>>
>> Depending on what your data are, you might
>> want to use either one
>> of the TxDb packages, or a SNPlocs
>> package. There really isn't
>> much to go on here. If I assume this is a
>> coordinate that one
>> might think is within an exon, and if I
>> further assume you are
>> working with H. sapiens, I could suggest
>> something like
>>
>> library(TxDb.Hsapiens.UCSC.**hg19.knownGene)
>> ex<-
>> exonsBy(TxDb.Hsapiens.UCSC.**hg19.knownGene,
>> "gene")
>>
>> x<- GRanges(seq = "chr17", IRanges(start =
>> 31245606, width = 1))
>>
>> ex[ex %in% x]
>>
>> or maybe more appropriately
>>
>> names(ex)[ex %in% x]
>>
>> which will give you the Gene ID, and you
>> can go from there using
>> the org.Hs.eg.db package.
>>
>> If however, your coordinate isn't in an
>> exon, but might be in a
>> UTR, you can look at ?exonsBy to see what
>> other sequences you can
>> extract to compare with.
>>
>> If these are SNPs, then you can look at
>> the help pages for the
>> relevant SNPlocs package.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> which can also be written this way by
>> the program that yielded
>> the result:
>> chr17.31246506
>>
>> And I need to convert this data into a
>> gene name and known
>> gene Ids, such
>> as:
>>
>> Gene name Entrez_ID Ensembl_ID
>>
>> Tff3 NM_011575 20050
>> Can you please advice me about a
>> module able to perform this
>> ID conversion
>> using a list of "chr17.31246506" type
>> coordinates as input?
>>
>> Thanks in advance
>>
>> With best wishes
>>
>>
>>
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor@...
>> <mailto:Bioconductor <at> r-**project.org<Bioconductor <at> r-project.org>
>> ><mailto:Bioconduct**or@... <Bioconductor@...>
>>
>> <mailto:Bioconductor <at> r-**project.org<Bioconductor <at> r-project.org>
>> >>
>> https://stat.ethz.ch/mailman/**
>> listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives:
>>
>> http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>> -- James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>>
>>
>>
>> --
>> --
>> Dr. José Luis Lavín Trueba
>>
>> Dpto. de Producción Agraria
>> Grupo de Genética y Microbiología
>> Universidad Pública de Navarra
>> 31006 Pamplona
>> Navarra
>> SPAIN
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> University of Washington
>> Environmental and Occupational Health Sciences
>> 4225 Roosevelt Way NE, # 100
>> Seattle WA 98105-6099
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor@...
>> <mailto:Bioconductor <at> r-**project.org<Bioconductor <at> r-project.org>
>> >
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives:
>>
>> http://news.gmane.org/gmane.**
>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>>
>>
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor@... <mailto:Bioconductor <at> r-**
>> project.org <Bioconductor@...>>
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives:
>> http://news.gmane.org/gmane.**science.biology.informatics.**
>> conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>> ______________________________**_________________
>> Bioconductor mailing list
>> Bioconductor@... <mailto:Bioconductor <at> r-**project.org<Bioconductor@...>
>> >
>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>> Search the archives:
>> http://news.gmane.org/gmane.**science.biology.informatics.**
>> conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>
>>
>> -- Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages@... <mailto:hpages@...>
>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>>
>>
>>
>>
>>
>> --
>> --
>> Dr. José Luis Lavín Trueba
>>
>> Dpto. de Producción Agraria
>> Grupo de Genética y Microbiología
>> Universidad Pública de Navarra
>> 31006 Pamplona
>> Navarra
>> SPAIN
>>
>
>
--
--
--
Dr. José Luis Lavín Trueba
Dpto. de Producción Agraria
Grupo de Genética y Microbiología
Universidad Pública de Navarra
31006 Pamplona
Navarra
SPAIN
[[alternative HTML version deleted]]
_______________________________________________ Bioconductor mailing list Bioconductor@... https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
RSS Feed