generate-permit-list

This command takes as input an output directory containing a RAD file (created by running alevin with the --rad and/or --sketch flags), and it determines what cell barcodes should be associated with “true” cells, which should be corrected to some “true” barcode, and which should simply be ignored / discarded. This command has 4 required arguments; the path to an input directory --input, the path to an output directory --output-dir (which will be created if it doesn’t exist), the expected orientation of properly mapped reads --expected-ori (the options are ‘fw’ (filters out alignments to the reverse complement strand), ‘rc’ (filter out alignments to the forward strand) and ‘both’ or ‘either’ (do not filter any alignments)), and then one of the following mutually exclusive options (which determines how the “true” barcodes are decided):

  • --knee-distance: This flag will use the distance method that is used in the whitelist command of UMI-tools to attempt to automatically determine the number of true barcodes. Briefly, this method first counts the number of reads associated with each barcode, and then sorts the barcodes in descending order by their associated read count. It then constructs the cumulative distribution function from this sorted list of frequencies. Finally, it applies an iterative algorithm to attempt to determine the optimal number of barcodes to include by looking for a “knee” or “elbow” in the CDF graph. The algorithm considers each barcode in the CDF where it’s x-coordinate is equal to this barcode’s rank divided by the total number of barcodes (i.e. its normalized rank) and the y-coordinate is equal to the (normalized) cumulative frequency achieved at this barcode. It then computes the distance of this barcode from the line x=y (defined by the start and end of the CDF). The initial knee is predicted as the point that has the maximum distance from the x=y line. The algorithm is iterative, because experiments with many low-quality barcodes may predict too many valid barcodes using this method. Thus, the algorithm is run repeatedly, each time considering a prefix of the CDF from index 0 through the previous knee’s index * 5. Once two subsequent iterations of the algorithm return the same knee point, the algorithm terminates.

  • --force-cells <ncells>: This option will count the number of reads associated with each barcode, and sort the barcodes in descending order of frequency. Then, it will consider the first <ncells> barcodes to be valid. Any barcode that has a number of reads >= to the <ncells>-th barcode will be considered part of the permit list, all others will not (but will be considered for correction to this permit list).

  • --valid-bc <bcfile>: This option will read the provided file <bcfile> and treat it as an explicitly-provided list of true, filtered barcodes (i.e. a list of barcodes believed to belong to a set of high-confidence cells truly present in the given sample). Barcodes appearing in this list will be considered to correspond to true and filtered cells, and barcodes will be corrected to this list. This flag is not designed to perform unfiltered quantification (i.e. correcting to a list of all possible barcodes generated by a technology, like e.g. the 10x v3 permit list). To correct against an unfiltered permit list, you should use the --unfiltered-pl flag described below (which is currently in beta).

  • --unfiltered-pl <plist>: This option accepts as an argument a list of possible barcodes for the sample. For example, this is the flag you should use if you wish to provide an “external permit list”, like the 10x v2 or 10x v3 permit lists. Unilike with the --valid-bc flag, the list passed to this argument is the set of all possible barcodes for the technology being processed, and it is likely that most of the barcodes in the file may not correspond to cells present in this particular sample. When using this argument, you may also pass the --min-reads argument to determine the minimum frequency with which a barcode must be seen in order to be retained. The algorithm used here will pass over the input records (mapped reads) and count how many times each of the barcodes in the unfiltered permit list occur exactly. Any barcode ocurring >= min-reads times will be considered as a present cell. Subsequently, all barcodes that did not match a present cell will be searched (at an edit distance of up to 1) againt the barcodes determined to correspond to present cells. If an initially non-matching barcode has a unique neighbor among the barcodes for present cells, it will be corrected to that barcode, but if it has no 1-edit neighbor, or if it has 2 or more 1-edit neighbors among that list (i.e. it’s correction would be ambiguous), then the record is discarded.

  • --min-reads <threshold>: This flag is meant to be used (and currently only applied) in conjunction with --unfiltered-pl. Any barcodes from the provided permit list that have >= <threshold> exact occurrences in the input file will be deemed as present cells and will be passed on to subsequent phases of quantification. Barcodes occurring < threshold number of times will be corrected against the set of present cells using the procedure described above.

  • --expect-cells <ncells>: This option uses the provided <ncells> as a hint, and tries to choose a robust cutoff around this value. The functionality of this option corresponds, approximately to what you would get from passing the flag --soloCellFilter <ncells> 0.99 10 to STARsolo.

output

The generate-permit-list command outputs a number of different files in the output directory. Not all files are relevant to users of alevin-fry, but the files are described here.

  1. The file all_freq.bin is a binary file that records, for each distinct barcode in the input RAD file, the number of read records that were tagged with this barcode.

  2. The file permit_freq.bin is a binary file that lists, for each barcode in the input RAD file that is determined to be a true barcode, the number of read records associated with this barcode.

  3. The file permit_map.bin is a binary file (a serde serialized HashMap) that maps each barcode in the input RAD file that is within an edit distance of 1 to some true barcode to the barcode to which it corrects. This allows the collate command to group together all of the read records corresponding to the same corrected barcode.

  4. The file generate_permit_list.json that is a JSON file containing information about the run of the command (currently, just the expected orientation).