quick start

You can find raw data at https://www.ncbi.nlm.nih.gov/geo/browse/ 1, or use the following handy tools:

gse_soft_ftp('GSE51280')
#> [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE51nnn/GSE51280/soft/GSE51280_family.soft.gz"

gse_matrix_ftp('GSE51280')
#> [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE51nnn/GSE51280/matrix/GSE51280_series_matrix.txt.gz"

Then you can read them:

# expression matrix file
read_gse_matrix(system.file('extdata/GSE51280_series_matrix.txt.gz', package = 'rGEO'))
#> # A tibble: 123 × 25
#>   ID_REF GSM1241791 GSM1241792 GSM1241793 GSM1241794 GSM1241795 GSM1241796
#>   <chr>       <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
#> 1 20          -3.88      -4.00      -3.56      -2.54      -3.98      -4.20
#> 2 21          -8.82      -9.61      -8.94      -5.44      -8.36      -8.95
#> 3 22          -3.65      -5.80      -1.00      -3.54      -2.81      -5.58
#> 4 23          -4.92      -7.61      -5.88      -4.60      -4.00      -4.75
#> 5 24          -3.51      -4.12      -3.92      -4.30      -4.31      -5.04
#> 6 25          -3.36      -3.90      -5.60      -2.52      -4.19      -4.12
#> # … with 117 more rows, and 18 more variable: GSM1241797 <dbl>, …

# SOFT (chip annotation) file
read_gse_soft(system.file('extdata/GSE51280_family.soft.gz', package = 'rGEO'))
#> # A tibble: 121 × 2
#>   ID_REF symbol 
#>   <chr>  <chr>  
#> 1 100    NDC80  
#> 2 101    NDRG1  
#> 3 102    NEUROG1
#> 4 103    NT5E   
#> 5 104    NUF2   
#> 6 106    PGR    
#> # … with 115 more rows

A little deeper

Let’s go a little deeper to see how this works.

The package mainly contains functions for
1. telling the ftp path to GSE raw data: see gds_ftp() and gpl_soft_ftp()
1. reading these files, including expression matrix file and SOFT (chip annotation) file, see below.

expression matrix file

reading expression matrix file is quiet straightforward, you just need to locate where the information lies, and readr::read_tsv() would take care of the rest for you.

read_gse_matrix
function (matrix_file) 
{
    if (!file.exists(matrix_file)) 
        return(NULL)
    matrix_raw <- readr::read_lines(matrix_file)
    matrix_boundary <- stringr::str_which(matrix_raw, "series_matrix_table")
    if (diff(matrix_boundary) == 2L) 
        return(NULL)
    matrix_raw[seq(matrix_boundary[1] + 1, matrix_boundary[2] - 
        1)] %>% I() %>% readr::read_tsv(T, readr::cols(.default = "c")) %>% 
        rm_problematic_row() %>% {
        suppressWarnings(dplyr::mutate_at(., -1, as.double))
    }
}
<bytecode: 0x57527bd41108>
<environment: namespace:rGEO>

SOFT (chip annotation) file

SOFT (chip annotation) file is a lot more harder to deal with. In short, you have to dig useful information first:

parse_gse_soft(system.file('extdata/GSE51280_family.soft.gz', package = 'rGEO'), F)
#> $accession
#> [1] "GPL17590"
#> 
#> $info
#> # A tibble: 5 × 2
#>   name        description                                                       
#>   <chr>       <chr>                                                             
#> 1 ID          ""                                                                
#> 2 GENE_SYMBOL "Gene Symbol"                                                     
#> 3 GB_ACC      "Genebank accession number LINK_PRE:\"http://www.ncbi.nlm.nih.gov…
#> 4 Class Name  "Gene Class"                                                      
#> 5 SPOT_ID     ""                                                                
#> 
#> $table
#> # A tibble: 142 × 5
#>   ID    GENE_SYMBOL  GB_ACC `Class Name` SPOT_ID
#>   <chr> <chr>        <chr>  <chr>        <chr>  
#> 1 1     POS_A(128)   NA     Positive     CONTROL
#> 2 2     POS_B(32)    NA     Positive     CONTROL
#> 3 3     POS_C(8)     NA     Positive     CONTROL
#> 4 4     POS_D(2)     NA     Positive     CONTROL
#> 5 5     POS_E(0.5)   NA     Positive     CONTROL
#> 6 6     POS_F(0.125) NA     Positive     CONTROL
#> # … with 136 more rows

and convert that to a uniform format. Almost the whole package does the latter, refer to vignette('probe2symbol') for more details.

By the way, for both parse_gse_soft() and read_gse_soft(), you can add verbose = T to make the process more clear.

Innovation

The format of SOFT (chip annotation) file is really a terrible messy, or rather, aganist humanity. Even though I have developed this handy package, I never want to touch it again.

The central problem is that we don’t know each probe corresponding to which gene.

The annotation (TSV) contains various information, such as Entrez Gene ID、Ensembl gene ID、INSDC accession、RefSeq accession (nucleotide), so I developed hgnc to universely convert them to HUGO gene symbol. That is actually a fairly easy task, what real drives one crazy is the messy column names. Given the following snippet 2, how can you make sure which columns exactly store Entrez Gene ID? ……

EntrezGeneID: EntrezGene ID
ORF: Entrez Gene identifier for the gene represented
GeneID: Entrez Gene identification number
UniGene_ID: UniGene Database Accession Number (http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene)
ORF: Entrez GENE ID
Entrez_Gene_ID: Entrez gene ID
GENE: Entrez GeneID of sequence used to design cDNA probe
Gene_ID: id in NCBI Entrez Gene database

  1. search for GSE***, download files from the Download family section at the bottom.↩︎

  2. the left part is column name, and the right part is explanation↩︎