DUMfound was developed as a technique for leveraging an annotated cell type atlas (generated with LIRIpipe) to complete cell type deconvolution for quantified ISH data. In this walkthrough, we will dissect the deconvolve_spatial function, the centerpiece of the pipeline.

Method overview

Loading data

DUMfound loads several objects from a fully annotated atlas directory or from user-specified locations.

Cell type annotations

By default, the method loads annotations from a pre-generated reference file, specifically “/nfs/turbo/umms-welchjd/BRAIN_initiative/BICCN_integration_Analyses/Base_Reference_Files/Reference_Annotations.RDS”. The annotation.level parameter can be used to specify if the annotations should be for neuronal or non-neuronal (1), the general type (2) or the specific type (3). However, by passing either the location of another file containing annotations, or a named factor vector to the known.annotations parameter, an alternative source can be selected.Alternatively, if known.annotations = NULL, then high resolution annotations are pulled from the results files for analyses 2, 4, and 5.

Expression data

DUMfound loads expression data from all provided RNA datasets, both single cell and single nucleus.

Spatial data

DUMfound loads spatial data corresponding to the spatial.data.name slot, meaning it must be loaded into the altas with the save_spatial_data function in advance.

Preprocessing

Sampling

To ensure that the method performs similarly on common and rare cell types, single cell data is downsampled, such that the maximum number of cells included from any one annotated type is n.cells.

Expression data

Sparse expression data is directly loaded from H5 files within the analysis 1 directory. The matrices are immediately subset to the cells selected during sampling, the genes shared with the spatial data, and are normalized.

Gene selection

To select genes, a Kruskal Wallis one-way nonparametric ANOVA is used. This method allows for selection of the genes that vary most between cell types, the most informative to utilize in deconvolution.. It is applied for each gene for each dataset across cell types, returning a chi-square statistic. To determine what genes are used for the deconvolution, a binary search technique is used, determining the set of genes above a given quantile threshold for all single cell datasets. The deconv.gene.num parameter is the target number of genes selected, with gene.num.tol being the maximum difference between the number of genes requested and ultimately selected. From current understanding, gene.num.tol < 10 is feasible without requiring excessive computational effort.

After this step, a file, gene_selection_output.RDS, is saved to the deconvolution directory, containing both the chi-square values calculated and the genes selected.

ANLS gene signature calculation

Gene signature calculation is completed with expression data corresponding to the sampled set of cells and the selected genes.

First, one-hot matrices of cell type assignment are constructed for each expression dataset. For dataset \(i\) of \(k\) datasets, the corresponding \(H_{indic,i}\) will be \(n_i\) by \(m\), where \(n_i\) is the number of cells sampled from that dataset and \(m\) is the total number of cell types. \(W\) and \(V_i\) are randomly initialized, such that they are all \(m\) by \(p\) matrices, where \(p\) is the number of genes selected.

We minimize the following objective.

\[\arg \min_{W\geq 0, V\geq 0} \sum_{i=1}^k\begin{Vmatrix} E_i-H_{indic,i}(W+V_i) \end{Vmatrix}^2_F + \lambda \sum_{i=1}^k\begin{Vmatrix} H_{indic,i}V_i \end{Vmatrix}^2_F\]

From this point, \(W\) and \(V_i\) are iteratively calculated, such that

\[W = \arg \min_{W \geq 0}\begin{Vmatrix} \begin{pmatrix} H_{indic,1}\\ \vdots \\ H_{indic,k} \end{pmatrix} W - \begin{pmatrix} E_1 -H_{indic,1}V_1\\ \vdots \\ E_k -H_{indic,k}V_k \end{pmatrix} \end{Vmatrix}^2_F\]

and

\[V_i = \arg \min_{V_i \geq 0}\begin{Vmatrix} \begin{pmatrix} H_{indic,i}\\ \sqrt{\lambda}H_{indic,i} \end{pmatrix}V_i - \begin{pmatrix} E_i -H_{indic,i}W\\ 0_{n_i\times m} \end{pmatrix} \end{Vmatrix}^2_F\]

lambda, the regularization factor, should be kept relatively small, with the default of 1 providing good results in the vast majority of cases. The alternating least squares algorithm completes updates while the difference between objective function evaluations between iterations is less than thresh and the number of iterations completed is less than max.iters

At completion, new \(H_i\) are calculated, such that

\[ H_i = \arg \min_{H_i \geq 0}\begin{Vmatrix} \begin{pmatrix} W^T+V_i^T\\ \sqrt{\lambda}V_i^T \end{pmatrix}H_{indic,i}^T - \begin{pmatrix} E_i^T1\\ 0_{n_i\times m} \end{pmatrix} \end{Vmatrix}^2_F \]

I have found that utilization of the calculated \(H_i\) may be useful if substituted for the \(H_i\) calculated with iNMF in cases where cell types are difficult to separate, but I have not developed this application sufficiently.

After this step, a file, gene_signature_output.RDS, is saved to the deconvolution directory, containing the calculated \(W\), \(V\), \(H\), and one hot matrices.

Deconvolution

After learning cell type signatures, the spatial data is subset to the selected genes, transposed, scaled not centered for each gene, and transposed again. After imposing nonnegativity on the transformed data, \(H_{deconv}\) is found by NMF regression, such that

\[H_{deconv} = \arg \min_{H \geq 0}\begin{Vmatrix} E_{spatial}-HW \end{Vmatrix})\]

A second matrix, \(H_{prop}\), is calculated by scaling the rows of \(H_{deconv}\) such that the rows sum to 1.

After this step, a file, deconvolution_output.RDS, is saved to the deconvolution directory, containing the calculated \(H_{deconv}\) and \(H_{prop}\)

Using `deconvolve_spatial`

The deconvolve_spatial function takes a large number of parameters. Here, they are listed and overviewed.

Data loading
- filepath - Path to directory within which the atlas structure was generated
- region - The anatomical region with which the analysis is associated
- spatial.data.name - A string, the name of the spatial dataset
- annotation.level A value, (1,2,3), corresponding to granularity of known resolution to use in the event that reference annotations are utilized. 3 by default
- known.annotations An object corresponding to cluster labels for single cells or a file path to a spreadsheet containing multiple levels of provided annotations
Preprocessing
- slide.seq - A logical, if the spatial data was collected with slide-seq overriding removal of genes from the spatial data. FALSE by default
- z - The number of standard deviations above the mean number of missing values at which to remove a gene from the spatial data. 1 by default
- n.umi.thresh - An integer, a minimum number of counts on selected spatial samples for selected genes. 150 by default
- n.cells - The maximum number of cells to select from each dataset for each cluster for use in gene selection and signature calculation. 500 by default
Gene selection
- deconv.gene.num - The number of genes to utilize for gene signature calculation. 2000 by default.
- gene.num.tol - The maximum difference between deconv.gene.num and the true number of genes used. 50 by default
Gene signature calculation
- lambda - Regularization parameter. Larger values penalize dataset-specific effects more strongly (ie. alignment should increase as lambda increases). 1 by default
- thresh - Convergence threshold. Convergence occurs when |obj0-obj|/(mean(obj0,obj)) < thresh. 1e-8 by default.
- max.iters - Maximum number of block coordinate descent iterations to perform. 100 by default
- nrep - Number of restarts to perform. 1 by default
- rand.seed Random seed to allow reproducible results. 1 by default
- print.obj - Print objective function values after convergence. FALSE by default
- verbose Print progress bar/messages. TRUE by default

Cell Type Deconvolution with DUMfound

Joshua Sodicoff

Method overview

Loading data

Cell type annotations

Expression data

Spatial data

Preprocessing

Sampling

Expression data

Gene selection

ANLS gene signature calculation

Deconvolution

Using `deconvolve_spatial`

Cell Type Deconvolution with DUMfound

Joshua Sodicoff

Method overview

Loading data

Cell type annotations

Expression data

Spatial data

Preprocessing

Sampling

Expression data

Gene selection

ANLS gene signature calculation

Deconvolution

Using deconvolve_spatial

Using `deconvolve_spatial`