Extract Fasta in list parallel GO

This software extracts Fasta sequences matching a list of keywords. This software is similar to Extract Fasta in list but written in GO for powerful computers. Sequences are read by groups of 100 (by default) and processed by 1 goroutine. The number of goroutines is only limited by the number of CPU cores.

This software is recommended for very large keywords lists on big Fasta files, it this case, it is about 20 times faster than Extract Fasta in list on 35 cores but need more RAM.

Manual :

The software was compiled for linux 64 bits.

1- install Perl free programming language and GNU parallel.

2- unzip the software

3- copy your Fasta files in the “fasta” directory.

4- copy your reference lists in the “lists” directory, one item per line.

The lines of the reference list are interpreted as strings and the search is performed only in the sequence name (it is not possible to seach a sequence). For example :

GEN1 matches GEN1, GEN11, GEN12 etc.

(GEN1) matches (GEN1)

5- edit parallel_extract_conf.txt to set the number of CPUs for parallel processing.

buffer=100 is the number of fasta sequences per threads. Depending on the length of the list, disk and CPU speed, the optimal buffer size can be adjusted, 100 is a good starting point.

n_threads=35 is the max number of GO threads

nb_cpu=8 is the max number of CPU for GNU parallel to process all the combinations of ( lists X fasta ) files in parallel

caution : the total max number of threads is n_threads x nb_cpu

6- execute the software by the command : perl parallel_extract_fasta-0.3.pl

The command line for non parallel execution is :

USAGE : ./read_fasta-028 -t 8 -r 100 -l lists/test.txt -f fasta/noncoding.fasta

with 8 goroutines and 100 sequences/goroutine

7- processed files are in the “results” directory.

8- search log files are in the “log” directory.