For impatient people, you can convert a probe-ensembl map like the following 1
<- tibble::tribble(
probe2ensembl ~ID, ~Ensembl,
"200064_at", "ENSG00000096384",
"200066_at", "ENSG00000113141",
"200068_s_at", "ENSG00000127022",
"200069_at", "ENSG00000075856",
"200071_at", "ENSG00000119953",
"200076_s_at", "ENSG00000105700",
"200077_s_at", "ENSG00000104904",
"200078_s_at", "ENSG00000117410",
"200082_s_at", "ENSG00000171863 /// ENSG00000183405 /// ENSG00000213326",
"200084_at", "ENSG00000110696"
)
probe2ensembl38;5;246m# A tibble: 10 × 2
[39m
[
ID Ensembl 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m 1
[39m 200064_at ENSG00000096384
[38;5;250m 2
[39m 200066_at ENSG00000113141
[38;5;250m 3
[39m 200068_s_at ENSG00000127022
[38;5;250m 4
[39m 200069_at ENSG00000075856
[38;5;250m 5
[39m 200071_at ENSG00000119953
[38;5;250m 6
[39m 200076_s_at ENSG00000105700
[38;5;250m 7
[39m 200077_s_at ENSG00000104904
[38;5;250m 8
[39m 200078_s_at ENSG00000117410
[38;5;250m 9
[39m 200082_s_at ENSG00000171863 /// ENSG00000183405 /// ENSG00000213326
[38;5;250m10
[39m 200084_at ENSG00000110696
[
to a probe-symbol map 2
<- probe2ensembl %>% melt_map("ID", "Ensembl", " /// ") %>%
probe2_symbol ::mutate("symbol" = as_symbol_from_ensembl(Ensembl)) %>%
dplyrcast_map("ID", "symbol", " /// ")
probe2_symbol38;5;246m# A tibble: 10 × 2
[39m
[
ID symbol 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m 1
[39m 200064_at HSP90AB1
[38;5;250m 2
[39m 200066_at IK
[38;5;250m 3
[39m 200068_s_at CANX
[38;5;250m 4
[39m 200069_at SART3
[38;5;250m 5
[39m 200071_at SMNDC1
[38;5;250m 6
[39m 200076_s_at KXD1
[38;5;250m 7
[39m 200077_s_at OAZ1
[38;5;250m 8
[39m 200078_s_at ATP6V0B
[38;5;250m 9
[39m 200082_s_at RPS7 /// RPS7P11
[38;5;250m10
[39m 200084_at C11orf58
[
Other common ID like entrez gene ID, Unigene ID, RefSeq accession are also supported.
probe2id <- tibble::tribble(
~probe, ~id,
"probe1", "id1 | id2",
"probe2", "id3"
)
probe2symbol <- tibble::tribble(
~probe, ~symbol,
"probe1", "symbol1 | symbol2",
"probe2", "symbol3"
)
To master this package, you need to understand the problem it aims to solve. That is, to turn something like
probe2id38;5;246m# A tibble: 2 × 2
[39m
[
probe id 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m probe1 id1 | id2
[38;5;250m2
[39m probe2 id3
[
to
probe2symbol38;5;246m# A tibble: 2 × 2
[39m
[
probe symbol 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m probe1 symbol1 | symbol2
[38;5;250m2
[39m probe2 symbol3
[
probe2id_easy <- tibble::tribble(
~probe, ~id,
"probe4", "id4",
"probe5", "id5",
"probe6", "id6"
)
probe2symbol_easy <- tibble::tribble(
~probe, ~symbol,
"probe4", "symbol4",
"probe5", "symbol5",
"probe6", "symbol6"
)
To point out the key difficulty, let’s contrast it with a simpler one — to turn something like
probe2id_easy38;5;246m# A tibble: 3 × 2
[39m
[
probe id 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m probe4 id4
[38;5;250m2
[39m probe5 id5
[38;5;250m3
[39m probe6 id6
[
to
probe2symbol_easy38;5;246m# A tibble: 3 × 2
[39m
[
probe symbol 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m probe4 symbol4
[38;5;250m2
[39m probe5 symbol5
[38;5;250m3
[39m probe6 symbol6
[
That’s quite easy, you just need a id-symbol map,
<- tibble::tibble(
id2symbol id = paste0("id", 1:6),
symbol = paste0("symbol", 1:6)
%>% dplyr::sample_frac()
)
id2symbol38;5;246m# A tibble: 6 × 2
[39m
[
id symbol 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m id6 symbol6
[38;5;250m2
[39m id5 symbol5
[38;5;250m3
[39m id2 symbol2
[38;5;250m4
[39m id4 symbol4
[38;5;250m5
[39m id3 symbol3
[38;5;250m6
[39m id1 symbol1
[
and use the following code 3
::transmute(probe2id_easy, probe, symbol = id2symbol$symbol[match(id, id2symbol$id)])
dplyr38;5;246m# A tibble: 3 × 2
[39m
[
probe symbol 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m probe4 symbol4
[38;5;250m2
[39m probe5 symbol5
[38;5;250m3
[39m probe6 symbol6
[
In the above code, we map probe
to symbol
in three steps:
dplyr::transmute
preserves probe2id$probe
- probe2id$id
relationship by position
match()
finds probe2id$id
-
id2symbol$id
relationship by value
[]
finds id2symbol$id
-
id2symbol$symbol
relationship by position
Let’s us understand the example by a concrete example of
"probe4"
to "symbol4"
:
1st element of probe2id$probe
-> 1st element of
probe2id$id
:"probe4"
is the 1st element of probe2id$probe
,
so we look for the 1st element of probe2id$id
,
"id4"
.
"id4"
in probe2id$id
->
"id4"
in id2symbol$id
:"id4"
is the 1st element of probe2id$id
, then
we look for the 1st element of match()
(which gives the
position of probe2id$id
in id2symbol$id
—
c(NA, 5)
in this case). We get 3
, so we look
for the 3rd element of id2symbol$id
, the exact value of
"id4"
.
3rd element of id2symbol$id
-> 3rd element of
id2symbol$symbol
:
finally, "id4"
is the 3rd element of
`id2symbol$id
, thus we look for the 3rd element of
id2symbol$symbol
, "symbol4"
.
Back the original problem, you can find that its fairly easy to
“replace” "id4"
with "symbol4"
,
"id5"
with "symbol5"
, etc (thanks to
match()
). But how can you “replace” the "id1"
and "id2"
inside "id1 | id2"
?
That is what we meet exactly, as in the 9th line of
probe2ensembl
.
%>% dplyr::slice(9)
probe2ensembl 38;5;246m# A tibble: 1 × 2
[39m
[
ID Ensembl 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m 200082_s_at ENSG00000171863 /// ENSG00000183405 /// ENSG00000213326
[
If you think it’s a piece of cake, you may have some misunderstanding:
Computer is very foolish, it can’t convert
"id1 | id2"
to "symbol1 | symbol2"
as you can
easily achieve even without thinking. In programming, the only way is to
search "id1"
in "id1 | id2"
and replace with
"symbol"
if it find one, then search "id2"
,
"id3"
, etc. This will cause a severe poor
performance.
As for replacing all "id"
with
"symbol"
, I use id2symbol
just to simplify the
problem, the id-symbol map in the real world is usually something
like:
::ensembl2symbol %>% dplyr::sample_n(3)
hgnc38;5;246m# A tibble: 3 × 2
[39m
[
ensembl symbol 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m ENSG00000252496 RNA5SP33
[38;5;250m2
[39m ENSG00000257730 LSM6P2
[38;5;250m3
[39m ENSG00000227418 PCGEM1
[
Inspired by reshape2, I choose to melt the wide map
<- probe2id
probe2id_wide
probe2id_wide38;5;246m# A tibble: 2 × 2
[39m
[
probe id 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m probe1 id1 | id2
[38;5;250m2
[39m probe2 id3
[
to a long map.
<- probe2id_wide %>% melt_map("probe", "id", " \\| ")
probe2id_long
probe2id_long38;5;246m# A tibble: 3 × 2
[39m
[
probe id 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m probe1 id1
[38;5;250m2
[39m probe1 id2
[38;5;250m3
[39m probe2 id3
[
Then map id to symbol to get a new long map, following the way we solve the simpler problem abobe.
<- probe2id_long %>%
probe2symbol_long ::transmute(probe, symbol = id2symbol$symbol[match(id, id2symbol$id)])
dplyr
probe2symbol_long38;5;246m# A tibble: 3 × 2
[39m
[
probe symbol 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m probe1 symbol1
[38;5;250m2
[39m probe1 symbol2
[38;5;250m3
[39m probe2 symbol3
[
Finally cast to a new wide map
<- probe2symbol_long %>% cast_map("probe", "symbol", " /// ")
probe2symbol_wide
# now it's identical to probe2symbol
probe2symbol_wide 38;5;246m# A tibble: 2 × 2
[39m
[
probe symbol 38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;250m1
[39m probe1 symbol1 /// symbol2
[38;5;250m2
[39m probe2 symbol3
[
In short, A-B wide map -> A-B long map -> A-C long map -> A-C wide map.
In above discussion, I abstract away many details to focus on core idea. Things get more complicated in real world:
,
、
, \\\
), so I use reverse
match.id2symbol
comes from (at least not falls
from sky). Actually I melt hgnc_complete_set.txt.gz
to create entrez2symbol
, ensembl2symbol
,
etc.symbol = id2symbol$symbol[match(id, id2symbol$id)]
in the
above code for universality, but it looks quiet obscured even though I
have explained in great detail. Thus I add some syntax sugar
(?as_symbol
), so that you can see the simple and beautiful
code in the beginning.Armed by with above weapon, the package can serves as the workhorse of rGEO to transform any user-supplied GPL file to standard chip file ready for GSEA.
how I create probe2ensembl
# read probe annotation from a GSE SOFT file
soft_table <- system.file("extdata/GSE19161_family.soft.gz", package = "rGEO") %>%
rGEO::parse_gse_soft(verbose = F) %>% {.$table}
# subset part of the table for the purpose of demostration
probe2ensembl <- soft_table %>% dplyr::select(1, 9) %>% dplyr::slice(41:50)
Ensembl IDs which don’t have corresponding HUGO symbol are discarded.↩︎
I deliberately shuffle the rows of
id2symbol
to show that id2symbol
just need to
provide the correct relationship between id
and
symbol
, i.e, it doesn’t necessarily maintain the same order
as probe2id
.↩︎