resolveLightChains - Define subgroups within clones based on light chain rearrangements
resolveLightChains resolve light chain V and J subgroups within a clone
resolveLightChains( data, nproc = 1, minseq = 1, locus = "locus", heavy = "IGH", id = "sequence_id", seq = "sequence_alignment", clone = "clone_id", cell = "cell_id", v_call = "v_call", j_call = "j_call", junc_len = "junction_length", nolight = "missing" )
- a tibble containing heavy and light chain sequences with clone_id
- number of cores for parallelization
- minimum number of sequences per clone
- name of column containing locus values
- value of heavy chains in locus column. All other values will be treated as light chains
- name of the column containing sequence identifiers.
- name of the column containing observed DNA sequences. All sequences in this column must be multiple aligned.
- name of the column containing the identifier for the clone. All entries in this column should be identical.
- name of the column containing identifier for cells.
- name of the column containing V-segment allele assignments. All entries in this column should be identical to the gene level.
- name of the column containing J-segment allele assignments. All entries in this column should be identical to the gene level.
- name of the column containing the length of the junction as a numeric value. All entries in this column should be identical for any given clone.
- string to use to indicate a missing light chain
a tibble containing the same data as inputting, but with the column clone_subgroup added. This column contains subgroups within clones that contain distinct light chain V and J genes, with at most one light chain per cell.
- Make temporary array containing light chain clones
- Enumerate all possible V, J, and junction length combinations
- Determine which combination is the most frequent
- Assign sequences with that combination to clone t
- Copy those sequences to return array
- Remove all cells with that combination from temp array
- Repeat 1-6 until temporary array zero. If there is more than rearrangement with the same V/J in the same cell, pick the one with the highest non-ambiguous characters. Cells with missing light chains are grouped with their subgroup with the closest matching heavy chain (Hamming distance) then the largest and lowest index subgroup if ties are present.
Outputs of the function are 1. clone_subgroup which identifies the light chain VJ rearrangement that sequence belongs to within it’s clone 2. clone_subgroup_id which combines the clone_id variable and the clone_subgroup variable by a “_”. 3. vj_cell which combines the vj_gene and vj_alt_cell columns by a “,”.