collate¶

This command takes as input a directory containing a RAD file (created by running alevin with the --justAlign and/or --sketch flags), as well as the directory generated as the result of running the generate-permit-list command of alevin-fry, and it will produce an output RAD file that is collated by (corrected) cellular barcode. The collated RAD file can then be quantified with the alevin-fry quant command. It also takes two other arguments (described below) that dictate how the collation and filtering will be performed.

-r, --rad-dir <rad-dir> : The directory containing the RAD file to be collated. This is the same directory on which you have previously run generate-permit-list and that was obtained by running alevin with the --justAlign flag).
-i, --input-dir <input-dir> : The input directory. This is the directory that was the output of generate-permit-list. This directory contains information computed by the generate-permit-list command that will allow successful collation and barcode correction. This is also the directory where the collated RAD file will be output.
--compress : This optional flag will tell alevin-fry to compress the output collated RAD file. The file will be compressed using the Snappy compression format (via the excellent snap crate. If this option is passed, the output file will be written to map.collated.rad.sz rather than map.collated.rad, and the corresponding status of the file’s compression will be written to collate.json in the output file. Note: The choice to use compression or not has no effect on the final result or the correctness of the output, but it may have some moderate performance implications. Specifically, it is potentially worth using this flag if you want to minimize disk space, and if you are using a sufficiently large number of threads (as compression happens in parallel, a sufficient number of threads will allow the compressed RAD file to be generated as quickly as the uncompressed). However, because some internal buffers must be duplicated during parallel compression, the collate step can use a bit more memory if run with the --compress flag, though the memory usage should still be small and stable over different sized inputs. There can also be an effect on quantification speed (since the collated RAD file will be decompressed on the fly during quantification), but it should be small since Snappy decompresses very fast, and decompression will only be the limiting factor if you are using a simple resolution strategy (e.g. naive or cr-like) and many quantification threads.
-m, --max-records <max-records> : The maximum number of read records to keep in memory at once during collation. The collate command will pass over the input RAD file multiple times collecting the records associated with a set of (corrected) cellular barcodes so that they can be written out in collated format to the output RAD file. This parameter determines (approximately) how many records will be held in memory at once, and therefore determines the memory usage of the collate command. The larger the value used the faster the collation process will be, since fewer passes are made. The smaller this value, the lower the memory usage will be, at the cost of more passes. The default value is 30,000,000. Note that this determines the number of records approximately, because a specific barcode will never be split across multiple collation passes. The algorithm employed is to collect the reads associated with different cellular barcodes in the current pass until the number of reads to be collected first exceeds this value.

output¶

The collate command will output all files it creates in the expected format in the output directory that is specified. It will write a file name map.collated.rad (or map.collated.rad.sz if run with the --compress flag), one named unmapped_bc_count_collated.bin, and one named collate.json in the directory specified by -i.