Read sequences from a fasta file. — read

read_fasta() reads all sequences contained in a fasta file into a tibble (see Value). See FASTA format for the requirements of input file.

read_fasta(file, per_line = FALSE, unalign = FALSE)

Arguments

file: string. Path to the fasta file.
per_line: logical scalar. Whether sequences keep in one line or might span multiple lines. Specify FALSE if not sure, refer to Details.
unalign: logical scalar. Whether unalign aligned sequences, i.e. remove '-', see Details.

Value

tibble

name: character. sequence header (without '>')
seq: character. sequence itself, I named it seq instead of sequence to save 62.5% of time (and everyone using R should take it for granted that seq means sequence)

Details

Previously, I implemented fasta as a named character (bioinfor::read_fasta()). But I found it more and more inconvenient as I used it. Many times when you manipulate the character, the name got lost. Finally, I decided to reimplement it as a tibble. That why this package come into being.

Actually, file can be anything as long as it's recognized by readr::read_lines().

If per_line isn't TRUE, read_fasta() will check it by detecting '>' and count line number. This (implemented by stringr::str_detect()) takes almost the same time as reading file content (implemented by readr::read_lines()). So specify TRUE if you are sure to save time. But this checking doesn't waste any time when sequences indeed span multiple lines. Additionly, a file contains odd number of lines will cause pre_line=TURE to be ignored

For unalign, when deal with large aligned file with linebreak containing many '-'s, implement unalign (stringr::str_replace_all() in read_fasta()) is about 15% faster than do that externally, i.e. after read_fasta() returns.

FASTA format

The minimum requirement is that it can only contain sequences (can use any character except for '>', at least at the begining) and their headers (begin with '>', one line). Comments are not supported.

Basically, c('>na', 'me', 'ATCG') (name span two lines) and c('>name', '>ATCG') (sequence begins with '>') are most common (and maybe the only) errors. Expect that, almost anything is acceptable (though your file may not be recongnized by other program):

name contains '>', see crazy example 1.
sequence contains '>' (can't be at the begining), see crazy example 2.
empty name, see crazy example 3.
empty sequence, see crazy example 4. But it must occupy an empty line if you want to scape check for pre_line (refer to details).
very long name, see crazy example 5. It can contain thousands of thousands characters, so long as you have enough memory (and disk if you are really mad).
don't you think the above is enough? We can't make friends anymore. (Never tell me you have tried crazy example 6)

In most situations, you won't have a file like that, unless you deliberately create one.

Test files

test1.fasta big file with linebreak.
test12.fasta same file without.
test2.fasta big aligned file with linebreak.

Examples

system.file('extdata', 'example.fasta', package = 'rutil') |> read_fasta()
#> # A tibble: 62 × 2
#>    name                            seq                                          
#>    <chr>                           <chr>                                        
#>  1 Acanthisitta_chloris_Jarvis     MVKLGSNLNDKNNKQPSNEDGFQTVPLITPLEVNHLQFPAPEKV…
#>  2 Amazona_aestiva_protein         MLAQGKGPQAQCSSQCSSHDASQSHCPPQLPASAGPRHHPAPPH…
#>  3 Anas_platyrhynchos_Jarvis       MVKLGSNLNDKNSKPPSGEDGFQTVPLITPLEVNHLQFPAPEKV…
#>  4 Anser_cygnoides_protein         MCEVSLGVACCVARNLGVKPRKPIRQGKVSQTGQKSSSHYPCDN…
#>  5 Antrostomus_carolinensis_Jarvis QVIVKTRTEYQPDQKNKGKLRVPKIAEFTVSFTDGVTERLKVTI…
#>  6 Apaloderma_vittatum_Jarvis      MVKLGSNLSDKSSKQPPNEDGFQTVPLITPLEVNHLQFPAPEKV…
#>  7 Aptenodytes_forsteri_Jarvis     MVKLGSNLNDKNNKQPSNEDGFQTVPLITPLEVNHLQFPAPEKV…
#>  8 Apteryx_australis_protein       VTILISLALAFLACIVFLVVYKAFTYDHSCPDGFVYKHKRCIPA…
#>  9 Aquila_chrysaetos_protein       MVKLGSNLNDKNSKQPSNEDGFQTVPLITPLEVNHLQFPAPEKV…
#> 10 Balearica_regulorum_Jarvis      MVKLGSNLNDKNNKQPSNEDGFQTVPLITPLEVNHLQFPAPEKV…
#> # … with 52 more rows

system.file('extdata', 'aligned-multiline.fasta', package = 'rutil') |> read_fasta()
#> # A tibble: 61 × 2
#>    name                            seq                                          
#>    <chr>                           <chr>                                        
#>  1 Acanthisitta_chloris_Jarvis     --------------------------------------------…
#>  2 Amazona_aestiva_protein         MRFWGGGWTLRXT-SEMGNXS----------VQRMRE-------…
#>  3 Anas_platyrhynchos_Jarvis       --------------------------------------------…
#>  4 Anser_cygnoides_protein         --------------------------------------------…
#>  5 Antrostomus_carolinensis_Jarvis --------------------------------------------…
#>  6 Apaloderma_vittatum_Jarvis      --------------------------------------------…
#>  7 Aptenodytes_forsteri_Jarvis     --------------------------------------------…
#>  8 Apteryx_australis_protein       --------------------------------------------…
#>  9 Aquila_chrysaetos_protein       --------------------------------------------…
#> 10 Balearica_regulorum_Jarvis      --------------------------------------------…
#> # … with 51 more rows

system.file('extdata', 'aligned-multiline.fasta', package = 'rutil') |> read_fasta(unalign = TRUE)
#> # A tibble: 61 × 2
#>    name                            seq                                          
#>    <chr>                           <chr>                                        
#>  1 Acanthisitta_chloris_Jarvis     AGVLALWALTTHGMYIQDFWRTWLRGLRFFLAVGIFFCVVALVA…
#>  2 Amazona_aestiva_protein         MRFWGGGWTLRXTSEMGNXSVQRMREGPGGVLALWALITHVMYV…
#>  3 Anas_platyrhynchos_Jarvis       RAGVLALWALITHVMYVQDYWRTWLKGLRFFLFIGILFSALSVV…
#>  4 Anser_cygnoides_protein         MYVQDYWRTWLKGLRFFLFIGILFSALSVVGFCTFLVLAITKHQ…
#>  5 Antrostomus_carolinensis_Jarvis AGVLALWALITHVMYVQDYWRTWLKGLRFFLFIGILFSALSVVG…
#>  6 Apaloderma_vittatum_Jarvis      RAGVLALWALTTHVMYVQDYWRTWLKGLRFFLFIGILFSALSAV…
#>  7 Aptenodytes_forsteri_Jarvis     AGVLALWALITHVMYVQDYWRTWLKGLRFFLFIGILFSALSVVG…
#>  8 Apteryx_australis_protein       MHLSFPLESDYWRTWLKGLRFFLFIGILFSALSVVGFCTFLVLA…
#>  9 Aquila_chrysaetos_protein       MFSGCIQPFPEWLLMFLCVLALWALITHVMYVQDYWRTWLKGLR…
#> 10 Balearica_regulorum_Jarvis      AGVLALWALITHVMYVQDYWRTWLKGLRFFLFIGILFSALSVVG…
#> # … with 51 more rows


# crazy examples
read_fasta('>na>me\nATCG')
#> # A tibble: 1 × 2
#>   name  seq  
#>   <chr> <chr>
#> 1 na>me ATCG 

read_fasta('>name\nAT>CG')
#> # A tibble: 1 × 2
#>   name  seq  
#>   <chr> <chr>
#> 1 name  AT>CG

read_fasta('>\nATCG')
#> # A tibble: 1 × 2
#>   name  seq  
#>   <chr> <chr>
#> 1 ""    ATCG 

read_fasta('>name\n')
#> # A tibble: 1 × 2
#>   name  seq  
#>   <chr> <chr>
#> 1 name  ""   

read_fasta(paste0(c('>', rep('x', 10000), '\nATCG'), collapse = ''))
#> # A tibble: 1 × 2
#>   name                                                                     seq  
#>   <chr>                                                                    <chr>
#> 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx… ATCG 

read_fasta(I('>name'))
#> # A tibble: 1 × 2
#>   name  seq  
#>   <chr> <chr>
#> 1 name  ""