atac

The atac command exposes the functionality of alevin-fry for processing RAD files containing scATAC-seq data. The atac command sets the mode of alevin-fry, and this command itself takes one of several various sub-commands (generate-permit-list and sort being the primary ones).

generate-permit-list (atac)

This command takes as input an output directory containing a RAD file (created by piscem), and it determines what cell barcodes should be associated with “true” cells, which should be corrected to some “true” barcode, and which should simply be ignored / discarded.

This command has 4 required arguments; the path to an input directory --input, the path to an output directory --output-dir (which will be created if it doesn’t exist), and a path to the barcode permit-list file. The functioning of this argument is as follows:

  • --unfiltered-pl <plist>: This option accepts as an argument a list of possible barcodes for the sample. For example, this is the flag you should use if you wish to provide an “external permit list”, like the 10x v2 or 10x v3 permit lists. Unilike with the --valid-bc flag, the list passed to this argument is the set of all possible barcodes for the technology being processed, and it is likely that most of the barcodes in the file may not correspond to cells present in this particular sample. When using this argument, you may also pass the --min-reads argument to determine the minimum frequency with which a barcode must be seen in order to be retained. The algorithm used here will pass over the input records (mapped reads) and count how many times each of the barcodes in the unfiltered permit list occur exactly. Any barcode ocurring >= min-reads times will be considered as a present cell. Subsequently, all barcodes that did not match a present cell will be searched (at an edit distance of up to 1) againt the barcodes determined to correspond to present cells. If an initially non-matching barcode has a unique neighbor among the barcodes for present cells, it will be corrected to that barcode, but if it has no 1-edit neighbor, or if it has 2 or more 1-edit neighbors among that list (i.e. it’s correction would be ambiguous), then the record is discarded.

output

The generate-permit-list command outputs a number of different files in the output directory. Not all files are relevant to users of alevin-fry, but the files are described here.

  1. The file bin_lens.bin is a binary file that records the lengths of the bins used for creating temporary files for sorting.

  2. The file bin_recs.bin is a binary file that encodes where records should be routed during the sorting phase.

  3. The file permit_freq.bin is a binary file that encodes information about the frequency of occurrence of different barcodes in the permit list.

  4. The file permit_map.bin is a binary file (a serde serialized HashMap) that maps each barcode in the input RAD file that is within an edit distance of 1 to some true barcode to the barcode to which it corrects. This allows the collate command to group together all of the read records corresponding to the same corrected barcode.

  1. The file generate_permit_list.json that is a JSON file containing information about the run of the command.

sort (atac)

This command takes as input the directory containing the original RAD file (created by piscem) and the output directory generated by the generate-permit-list command above. It parses the input RAD file, buckets and then sorts the records by genomic location, and produces a globally-sorted BED file for downstream analysis. The process is highly multi-threaded, and the number of threads can be chosen by passing the appropriate argument to the --threads command. The output BED file can optionally be compressed if the --compress flag is passed to the sort command. The output of the sort command id described below.

output

The sort command outputs the following files:

  1. The sort.json file is a JSON file containing information about how the sort command was run.

  2. The map.bed file (or map.bed.gz if the --compress flag was passed) contains the output filed in BED format that can be provided to a peak caller like MACS.