Error message File is larger than 1000000 Bytes

IMPORTANT: This is the legacy GATK documentation. This information is only valid until Dec 31st 2019. For latest documentation and forum click here

created by Tiffany_at_Broad

on 2018-06-15

You’ve likely used one of the read_X functions in your WDL and have surpassed the default limits set in FireCloud’s [Cromwell](https://software.broadinstitute.org/firecloud/documentation/article?id=10959 “Cromwell”) instance. In practice when Cromwell starts reading a file and surpasses the size limit, Cromwell will immediately stop downloading and fail the workflow giving you this error message. If you want to know why these limits were introduced, read this blog post (link coming soon).

Limits

read_lines: = 1MB
read_json = 1MB
read_tsv = 1MB
read_object = 1MB
read_boolean = 7 bytes
read_int = 19 bytes
read_float = 50 bytes
read_string = 128KB
read_map = 128KB

Workarounds

In the case where you are using read_lines() with a large file of filenames and are getting an error, the best workaround will be to split the large file by line count into multiple small files, scatter over the array of small files, and grab the filename by reading contents of each small file. This same concept can be applied to other read_X errors.

Here are two example WDLs for inspiration:

Option 1

workflow w { File fileOfFilenames # 1GB in size #Split large file into small individual files call splitFile { input: largeFile = fileOfFilenames } scatter (f in splitFile.tiny_files) { String fileName = read_string(f) } Array[String] filenames = fileName } task splitFile { File largeFile command { mkdir sandbox split -l 1 ${largeFile} sandbox/ } output { Array[File] tiny_files = glob(“sandbox/*”) } runtime { docker: “ubuntu:latest“ } }

Option 2

workflow use_file_of_filenames { File file_of_filenames call count_filenames_in_file { input: file_of_filenames = file_of_filenames } scatter (index in range(count_filenames_in_file.count)) { call operate_on_file { input: file_of_filenames = file_of_filenames, file_index = index } } } task count_filenames_in_file { File file_of_filenames command { wc -l < ${file_of_filenames} } output { Int count = read_int(stdout()) } } task operate_on_file { File file_of_filenames Int file_index command { # 1: Get the appropriate file name from the list # 2: Operate on that file as a URL } }

Updated on 2018-06-15

Report abuse