DUMfound was developed as a technique for leveraging an annotated cell type atlas (generated with LIRIpipe) to complete cell type deconvolution for quantified ISH data. In this walkthrough, we will dissect the deconvolve_spatial function, the centerpiece of the pipeline.

Method overview

Loading data

DUMfound loads several objects from a fully annotated atlas directory or from user-specified locations.

Cell type annotations

By default, the method loads annotations from a pre-generated reference file, specifically “/nfs/turbo/umms-welchjd/BRAIN_initiative/BICCN_integration_Analyses/Base_Reference_Files/Reference_Annotations.RDS”. The annotation.level parameter can be used to specify if the annotations should be for neuronal or non-neuronal (1), the general type (2) or the specific type (3). However, by passing either the location of another file containing annotations, or a named factor vector to the known.annotations parameter, an alternative source can be selected.Alternatively, if known.annotations = NULL, then high resolution annotations are pulled from the results files for analyses 2, 4, and 5.

Expression data

DUMfound loads expression data from all provided RNA datasets, both single cell and single nucleus.

Spatial data

DUMfound loads spatial data corresponding to the spatial.data.name slot, meaning it must be loaded into the altas with the save_spatial_data function in advance.

Preprocessing

Sampling

To ensure that the method performs similarly on common and rare cell types, single cell data is downsampled, such that the maximum number of cells included from any one annotated type is n.cells.

Expression data

Sparse expression data is directly loaded from H5 files within the analysis 1 directory. The matrices are immediately subset to the cells selected during sampling, the genes shared with the spatial data, and are normalized.

Gene selection

To select genes, a Kruskal Wallis one-way nonparametric ANOVA is used. This method allows for selection of the genes that vary most between cell types, the most informative to utilize in deconvolution.. It is applied for each gene for each dataset across cell types, returning a chi-square statistic. To determine what genes are used for the deconvolution, a binary search technique is used, determining the set of genes above a given quantile threshold for all single cell datasets. The deconv.gene.num parameter is the target number of genes selected, with gene.num.tol being the maximum difference between the number of genes requested and ultimately selected. From current understanding, gene.num.tol < 10 is feasible without requiring excessive computational effort.

After this step, a file, gene_selection_output.RDS, is saved to the deconvolution directory, containing both the chi-square values calculated and the genes selected.

ANLS gene signature calculation

Gene signature calculation is completed with expression data corresponding to the sampled set of cells and the selected genes.

First, one-hot matrices of cell type assignment are constructed for each expression dataset. For dataset \(i\) of \(k\) datasets, the corresponding \(H_{indic,i}\) will be \(n_i\) by \(m\), where \(n_i\) is the number of cells sampled from that dataset and \(m\) is the total number of cell types. \(W\) and \(V_i\) are randomly initialized, such that they are all \(m\) by \(p\) matrices, where \(p\) is the number of genes selected.

We minimize the following objective.

\[\arg \min_{W\geq 0, V\geq 0} \sum_{i=1}^k\begin{Vmatrix} E_i-H_{indic,i}(W+V_i) \end{Vmatrix}^2_F + \lambda \sum_{i=1}^k\begin{Vmatrix} H_{indic,i}V_i \end{Vmatrix}^2_F\]

From this point, \(W\) and \(V_i\) are iteratively calculated, such that

\[W = \arg \min_{W \geq 0}\begin{Vmatrix} \begin{pmatrix} H_{indic,1}\\ \vdots \\ H_{indic,k} \end{pmatrix} W - \begin{pmatrix} E_1 -H_{indic,1}V_1\\ \vdots \\ E_k -H_{indic,k}V_k \end{pmatrix} \end{Vmatrix}^2_F\]

and

\[V_i = \arg \min_{V_i \geq 0}\begin{Vmatrix} \begin{pmatrix} H_{indic,i}\\ \sqrt{\lambda}H_{indic,i} \end{pmatrix}V_i - \begin{pmatrix} E_i -H_{indic,i}W\\ 0_{n_i\times m} \end{pmatrix} \end{Vmatrix}^2_F\]

lambda, the regularization factor, should be kept relatively small, with the default of 1 providing good results in the vast majority of cases. The alternating least squares algorithm completes updates while the difference between objective function evaluations between iterations is less than thresh and the number of iterations completed is less than max.iters

At completion, new \(H_i\) are calculated, such that

\[ H_i = \arg \min_{H_i \geq 0}\begin{Vmatrix} \begin{pmatrix} W^T+V_i^T\\ \sqrt{\lambda}V_i^T \end{pmatrix}H_{indic,i}^T - \begin{pmatrix} E_i^T1\\ 0_{n_i\times m} \end{pmatrix} \end{Vmatrix}^2_F \]

I have found that utilization of the calculated \(H_i\) may be useful if substituted for the \(H_i\) calculated with iNMF in cases where cell types are difficult to separate, but I have not developed this application sufficiently.

After this step, a file, gene_signature_output.RDS, is saved to the deconvolution directory, containing the calculated \(W\), \(V\), \(H\), and one hot matrices.

Deconvolution

After learning cell type signatures, the spatial data is subset to the selected genes, transposed, scaled not centered for each gene, and transposed again. After imposing nonnegativity on the transformed data, \(H_{deconv}\) is found by NMF regression, such that

\[H_{deconv} = \arg \min_{H \geq 0}\begin{Vmatrix} E_{spatial}-HW \end{Vmatrix})\]

A second matrix, \(H_{prop}\), is calculated by scaling the rows of \(H_{deconv}\) such that the rows sum to 1.

After this step, a file, deconvolution_output.RDS, is saved to the deconvolution directory, containing the calculated \(H_{deconv}\) and \(H_{prop}\)

Using deconvolve_spatial

The deconvolve_spatial function takes a large number of parameters. Here, they are listed and overviewed.