resolveLightChains - Define subgroups within clones based on light chain rearrangements

Description

resolveLightChains resolve light chain V and J subgroups within a clone

Usage

resolveLightChains(
data,
nproc = 1,
minseq = 1,
locus = "locus",
heavy = "IGH",
id = "sequence_id",
seq = "sequence_alignment",
clone = "clone_id",
cell = "cell_id",
v_call = "v_call",
j_call = "j_call",
junc_len = "junction_length",
nolight = "missing"
)

Arguments

data
a tibble containing heavy and light chain sequences with clone_id
nproc
number of cores for parallelization
minseq
minimum number of sequences per clone
locus
name of column containing locus values
heavy
value of heavy chains in locus column. All other values will be treated as light chains
id
name of the column containing sequence identifiers.
seq
name of the column containing observed DNA sequences. All sequences in this column must be multiple aligned.
clone
name of the column containing the identifier for the clone. All entries in this column should be identical.
cell
name of the column containing identifier for cells.
v_call
name of the column containing V-segment allele assignments. All entries in this column should be identical to the gene level.
j_call
name of the column containing J-segment allele assignments. All entries in this column should be identical to the gene level.
junc_len
name of the column containing the length of the junction as a numeric value. All entries in this column should be identical for any given clone.
nolight
string to use to indicate a missing light chain

Value

a tibble containing the same data as inputting, but with the column clone_subgroup added. This column contains subgroups within clones that contain distinct light chain V and J genes, with at most one light chain per cell.

Details

  1. Make temporary array containing light chain clones
  2. Enumerate all possible V, J, and junction length combinations
  3. Determine which combination is the most frequent
  4. Assign sequences with that combination to clone t
  5. Copy those sequences to return array
  6. Remove all cells with that combination from temp array
  7. Repeat 1-6 until temporary array zero. If there is more than rearrangement with the same V/J in the same cell, pick the one with the highest non-ambiguous characters. Cells with missing light chains are grouped with their subgroup with the closest matching heavy chain (Hamming distance) then the largest and lowest index subgroup if ties are present.

Outputs of the function are 1. clone_subgroup which identifies the light chain VJ rearrangement that sequence belongs to within it’s clone 2. clone_subgroup_id which combines the clone_id variable and the clone_subgroup variable by a “_”. 3. vj_cell which combines the vj_gene and vj_alt_cell columns by a “,”.