You can find raw data at https://www.ncbi.nlm.nih.gov/geo/browse/ 1, or use the following handy tools:
gse_soft_ftp('GSE51280')
#> [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE51nnn/GSE51280/soft/GSE51280_family.soft.gz"
gse_matrix_ftp('GSE51280')
#> [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE51nnn/GSE51280/matrix/GSE51280_series_matrix.txt.gz"
Then you can read them:
# expression matrix file
read_gse_matrix(system.file('extdata/GSE51280_series_matrix.txt.gz', package = 'rGEO'))
#> # A tibble: 123 × 25
#> ID_REF GSM1241791 GSM1241792 GSM1241793 GSM1241794 GSM1241795 GSM1241796
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 20 -3.88 -4.00 -3.56 -2.54 -3.98 -4.20
#> 2 21 -8.82 -9.61 -8.94 -5.44 -8.36 -8.95
#> 3 22 -3.65 -5.80 -1.00 -3.54 -2.81 -5.58
#> 4 23 -4.92 -7.61 -5.88 -4.60 -4.00 -4.75
#> 5 24 -3.51 -4.12 -3.92 -4.30 -4.31 -5.04
#> 6 25 -3.36 -3.90 -5.60 -2.52 -4.19 -4.12
#> # … with 117 more rows, and 18 more variable: GSM1241797 <dbl>, …
# SOFT (chip annotation) file
read_gse_soft(system.file('extdata/GSE51280_family.soft.gz', package = 'rGEO'))
#> # A tibble: 121 × 2
#> ID_REF symbol
#> <chr> <chr>
#> 1 100 NDC80
#> 2 101 NDRG1
#> 3 102 NEUROG1
#> 4 103 NT5E
#> 5 104 NUF2
#> 6 106 PGR
#> # … with 115 more rows
Let’s go a little deeper to see how this works.
The package mainly contains functions for
1. telling the ftp path to GSE raw data: see gds_ftp()
and
gpl_soft_ftp()
1. reading these files, including expression matrix file and SOFT (chip
annotation) file, see below.
reading expression matrix file is quiet straightforward, you just
need to locate where the information lies, and
readr::read_tsv()
would take care of the rest for you.
read_gse_matrix
function (matrix_file)
{if (!file.exists(matrix_file))
return(NULL)
<- readr::read_lines(matrix_file)
matrix_raw <- stringr::str_which(matrix_raw, "series_matrix_table")
matrix_boundary if (diff(matrix_boundary) == 2L)
return(NULL)
seq(matrix_boundary[1] + 1, matrix_boundary[2] -
matrix_raw[1)] %>% I() %>% readr::read_tsv(T, readr::cols(.default = "c")) %>%
rm_problematic_row() %>% {
suppressWarnings(dplyr::mutate_at(., -1, as.double))
}
}<bytecode: 0x57527bd41108>
<environment: namespace:rGEO>
SOFT (chip annotation) file is a lot more harder to deal with. In short, you have to dig useful information first:
parse_gse_soft(system.file('extdata/GSE51280_family.soft.gz', package = 'rGEO'), F)
#> $accession
#> [1] "GPL17590"
#>
#> $info
#> # A tibble: 5 × 2
#> name description
#> <chr> <chr>
#> 1 ID ""
#> 2 GENE_SYMBOL "Gene Symbol"
#> 3 GB_ACC "Genebank accession number LINK_PRE:\"http://www.ncbi.nlm.nih.gov…
#> 4 Class Name "Gene Class"
#> 5 SPOT_ID ""
#>
#> $table
#> # A tibble: 142 × 5
#> ID GENE_SYMBOL GB_ACC `Class Name` SPOT_ID
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 POS_A(128) NA Positive CONTROL
#> 2 2 POS_B(32) NA Positive CONTROL
#> 3 3 POS_C(8) NA Positive CONTROL
#> 4 4 POS_D(2) NA Positive CONTROL
#> 5 5 POS_E(0.5) NA Positive CONTROL
#> 6 6 POS_F(0.125) NA Positive CONTROL
#> # … with 136 more rows
and convert that to a uniform format. Almost the whole package does
the latter, refer to vignette('probe2symbol')
for more
details.
By the way, for both parse_gse_soft()
and
read_gse_soft()
, you can add verbose = T
to
make the process more clear.
The format of SOFT (chip annotation) file is really a terrible messy, or rather, aganist humanity. Even though I have developed this handy package, I never want to touch it again.
The central problem is that we don’t know each probe corresponding to which gene.
The annotation (TSV) contains various information, such as Entrez Gene ID、Ensembl gene ID、INSDC accession、RefSeq accession (nucleotide), so I developed hgnc to universely convert them to HUGO gene symbol. That is actually a fairly easy task, what real drives one crazy is the messy column names. Given the following snippet 2, how can you make sure which columns exactly store Entrez Gene ID? ……
EntrezGeneID: EntrezGene ID
ORF: Entrez Gene identifier for the gene represented
GeneID: Entrez Gene identification number
UniGene_ID: UniGene Database Accession Number (http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene)
ORF: Entrez GENE ID
Entrez_Gene_ID: Entrez gene ID
GENE: Entrez GeneID of sequence used to design cDNA probe
Gene_ID: id in NCBI Entrez Gene database