read_fasta()
reads all sequences contained in a fasta file
into a tibble (see Value). See FASTA format for the requirements of
input file.
read_fasta(file, per_line = FALSE, unalign = FALSE)
string. Path to the fasta file.
logical scalar. Whether sequences keep in one line or might
span multiple lines. Specify FALSE
if not sure, refer to Details.
logical scalar. Whether unalign aligned sequences, i.e. remove '-', see Details.
tibble
name: character. sequence header (without '>')
seq: character. sequence itself, I named it seq
instead of sequence
to
save 62.5% of time (and everyone using R should take it for granted that seq
means sequence)
Previously, I implemented fasta as a named character
(bioinfor::read_fasta()
). But I found it more and more inconvenient as I
used it. Many times when you manipulate the character, the name got lost.
Finally, I decided to reimplement it as a tibble. That why this package
come into being.
Actually, file
can be anything as long as it's recognized by
readr::read_lines()
.
If per_line
isn't TRUE
, read_fasta()
will check it by detecting '>'
and count line number. This (implemented by stringr::str_detect()
) takes
almost the same time as reading file content (implemented by
readr::read_lines()
). So specify TRUE
if you are sure to save time. But
this checking doesn't waste any time when sequences indeed span multiple
lines. Additionly, a file contains odd number of lines will cause
pre_line=TURE
to be ignored
For unalign
, when deal with large aligned file with linebreak containing
many '-'s, implement unalign (stringr::str_replace_all()
in
read_fasta()
) is about 15% faster than do that externally, i.e. after
read_fasta()
returns.
The minimum requirement is that it can only contain sequences (can use any character except for '>', at least at the begining) and their headers (begin with '>', one line). Comments are not supported.
Basically, c('>na', 'me', 'ATCG')
(name span two lines) and c('>name', '>ATCG')
(sequence begins with '>') are most common (and maybe the only)
errors. Expect that, almost anything is acceptable (though your file may
not be recongnized by other program):
name contains '>', see crazy example 1.
sequence contains '>' (can't be at the begining), see crazy example 2.
empty name, see crazy example 3.
empty sequence, see crazy example 4. But it must occupy an empty line if
you want to scape check for pre_line
(refer to details).
very long name, see crazy example 5. It can contain thousands of thousands characters, so long as you have enough memory (and disk if you are really mad).
don't you think the above is enough? We can't make friends anymore. (Never tell me you have tried crazy example 6)
In most situations, you won't have a file like that, unless you deliberately create one.
test1.fasta big file with linebreak.
test12.fasta same file without.
test2.fasta big aligned file with linebreak.
system.file('extdata', 'example.fasta', package = 'rutil') |> read_fasta()
#> # A tibble: 62 × 2
#> name seq
#> <chr> <chr>
#> 1 Acanthisitta_chloris_Jarvis MVKLGSNLNDKNNKQPSNEDGFQTVPLITPLEVNHLQFPAPEKV…
#> 2 Amazona_aestiva_protein MLAQGKGPQAQCSSQCSSHDASQSHCPPQLPASAGPRHHPAPPH…
#> 3 Anas_platyrhynchos_Jarvis MVKLGSNLNDKNSKPPSGEDGFQTVPLITPLEVNHLQFPAPEKV…
#> 4 Anser_cygnoides_protein MCEVSLGVACCVARNLGVKPRKPIRQGKVSQTGQKSSSHYPCDN…
#> 5 Antrostomus_carolinensis_Jarvis QVIVKTRTEYQPDQKNKGKLRVPKIAEFTVSFTDGVTERLKVTI…
#> 6 Apaloderma_vittatum_Jarvis MVKLGSNLSDKSSKQPPNEDGFQTVPLITPLEVNHLQFPAPEKV…
#> 7 Aptenodytes_forsteri_Jarvis MVKLGSNLNDKNNKQPSNEDGFQTVPLITPLEVNHLQFPAPEKV…
#> 8 Apteryx_australis_protein VTILISLALAFLACIVFLVVYKAFTYDHSCPDGFVYKHKRCIPA…
#> 9 Aquila_chrysaetos_protein MVKLGSNLNDKNSKQPSNEDGFQTVPLITPLEVNHLQFPAPEKV…
#> 10 Balearica_regulorum_Jarvis MVKLGSNLNDKNNKQPSNEDGFQTVPLITPLEVNHLQFPAPEKV…
#> # … with 52 more rows
system.file('extdata', 'aligned-multiline.fasta', package = 'rutil') |> read_fasta()
#> # A tibble: 61 × 2
#> name seq
#> <chr> <chr>
#> 1 Acanthisitta_chloris_Jarvis --------------------------------------------…
#> 2 Amazona_aestiva_protein MRFWGGGWTLRXT-SEMGNXS----------VQRMRE-------…
#> 3 Anas_platyrhynchos_Jarvis --------------------------------------------…
#> 4 Anser_cygnoides_protein --------------------------------------------…
#> 5 Antrostomus_carolinensis_Jarvis --------------------------------------------…
#> 6 Apaloderma_vittatum_Jarvis --------------------------------------------…
#> 7 Aptenodytes_forsteri_Jarvis --------------------------------------------…
#> 8 Apteryx_australis_protein --------------------------------------------…
#> 9 Aquila_chrysaetos_protein --------------------------------------------…
#> 10 Balearica_regulorum_Jarvis --------------------------------------------…
#> # … with 51 more rows
system.file('extdata', 'aligned-multiline.fasta', package = 'rutil') |> read_fasta(unalign = TRUE)
#> # A tibble: 61 × 2
#> name seq
#> <chr> <chr>
#> 1 Acanthisitta_chloris_Jarvis AGVLALWALTTHGMYIQDFWRTWLRGLRFFLAVGIFFCVVALVA…
#> 2 Amazona_aestiva_protein MRFWGGGWTLRXTSEMGNXSVQRMREGPGGVLALWALITHVMYV…
#> 3 Anas_platyrhynchos_Jarvis RAGVLALWALITHVMYVQDYWRTWLKGLRFFLFIGILFSALSVV…
#> 4 Anser_cygnoides_protein MYVQDYWRTWLKGLRFFLFIGILFSALSVVGFCTFLVLAITKHQ…
#> 5 Antrostomus_carolinensis_Jarvis AGVLALWALITHVMYVQDYWRTWLKGLRFFLFIGILFSALSVVG…
#> 6 Apaloderma_vittatum_Jarvis RAGVLALWALTTHVMYVQDYWRTWLKGLRFFLFIGILFSALSAV…
#> 7 Aptenodytes_forsteri_Jarvis AGVLALWALITHVMYVQDYWRTWLKGLRFFLFIGILFSALSVVG…
#> 8 Apteryx_australis_protein MHLSFPLESDYWRTWLKGLRFFLFIGILFSALSVVGFCTFLVLA…
#> 9 Aquila_chrysaetos_protein MFSGCIQPFPEWLLMFLCVLALWALITHVMYVQDYWRTWLKGLR…
#> 10 Balearica_regulorum_Jarvis AGVLALWALITHVMYVQDYWRTWLKGLRFFLFIGILFSALSVVG…
#> # … with 51 more rows
# crazy examples
read_fasta('>na>me\nATCG')
#> # A tibble: 1 × 2
#> name seq
#> <chr> <chr>
#> 1 na>me ATCG
read_fasta('>name\nAT>CG')
#> # A tibble: 1 × 2
#> name seq
#> <chr> <chr>
#> 1 name AT>CG
read_fasta('>\nATCG')
#> # A tibble: 1 × 2
#> name seq
#> <chr> <chr>
#> 1 "" ATCG
read_fasta('>name\n')
#> # A tibble: 1 × 2
#> name seq
#> <chr> <chr>
#> 1 name ""
read_fasta(paste0(c('>', rep('x', 10000), '\nATCG'), collapse = ''))
#> # A tibble: 1 × 2
#> name seq
#> <chr> <chr>
#> 1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx… ATCG
read_fasta(I('>name'))
#> # A tibble: 1 × 2
#> name seq
#> <chr> <chr>
#> 1 name ""