DUMfound was developed as a technique for leveraging an annotated cell type atlas (generated with LIRIpipe) to complete cell type deconvolution for quantified ISH data. In this walkthrough, we will dissect the deconvolve_spatial function, the centerpiece of the pipeline.
DUMfound loads several objects from a fully annotated atlas directory or from user-specified locations.
By default, the method loads annotations from a pre-generated reference file, specifically “/nfs/turbo/umms-welchjd/BRAIN_initiative/BICCN_integration_Analyses/Base_Reference_Files/Reference_Annotations.RDS”. The annotation.level parameter can be used to specify if the annotations should be for neuronal or non-neuronal (1), the general type (2) or the specific type (3). However, by passing either the location of another file containing annotations, or a named factor vector to the known.annotations parameter, an alternative source can be selected.Alternatively, if known.annotations = NULL, then high resolution annotations are pulled from the results files for analyses 2, 4, and 5.
DUMfound loads expression data from all provided RNA datasets, both single cell and single nucleus.
DUMfound loads spatial data corresponding to the spatial.data.name slot, meaning it must be loaded into the altas with the save_spatial_data function in advance.
To ensure that the method performs similarly on common and rare cell types, single cell data is downsampled, such that the maximum number of cells included from any one annotated type is n.cells.
Sparse expression data is directly loaded from H5 files within the analysis 1 directory. The matrices are immediately subset to the cells selected during sampling, the genes shared with the spatial data, and are normalized.
To select genes, a Kruskal Wallis one-way nonparametric ANOVA is used. This method allows for selection of the genes that vary most between cell types, the most informative to utilize in deconvolution.. It is applied for each gene for each dataset across cell types, returning a chi-square statistic. To determine what genes are used for the deconvolution, a binary search technique is used, determining the set of genes above a given quantile threshold for all single cell datasets. The deconv.gene.num parameter is the target number of genes selected, with gene.num.tol being the maximum difference between the number of genes requested and ultimately selected. From current understanding, gene.num.tol < 10 is feasible without requiring excessive computational effort.
After this step, a file, gene_selection_output.RDS, is saved to the deconvolution directory, containing both the chi-square values calculated and the genes selected.
Gene signature calculation is completed with expression data corresponding to the sampled set of cells and the selected genes.
First, one-hot matrices of cell type assignment are constructed for each expression dataset. For dataset \(i\) of \(k\) datasets, the corresponding \(H_{indic,i}\) will be \(n_i\) by \(m\), where \(n_i\) is the number of cells sampled from that dataset and \(m\) is the total number of cell types. \(W\) and \(V_i\) are randomly initialized, such that they are all \(m\) by \(p\) matrices, where \(p\) is the number of genes selected.
We minimize the following objective.
\[\arg \min_{W\geq 0, V\geq 0} \sum_{i=1}^k\begin{Vmatrix} E_i-H_{indic,i}(W+V_i) \end{Vmatrix}^2_F + \lambda \sum_{i=1}^k\begin{Vmatrix} H_{indic,i}V_i \end{Vmatrix}^2_F\]
From this point, \(W\) and \(V_i\) are iteratively calculated, such that
\[W = \arg \min_{W \geq 0}\begin{Vmatrix} \begin{pmatrix} H_{indic,1}\\ \vdots \\ H_{indic,k} \end{pmatrix} W - \begin{pmatrix} E_1 -H_{indic,1}V_1\\ \vdots \\ E_k -H_{indic,k}V_k \end{pmatrix} \end{Vmatrix}^2_F\]
and
\[V_i = \arg \min_{V_i \geq 0}\begin{Vmatrix} \begin{pmatrix} H_{indic,i}\\ \sqrt{\lambda}H_{indic,i} \end{pmatrix}V_i - \begin{pmatrix} E_i -H_{indic,i}W\\ 0_{n_i\times m} \end{pmatrix} \end{Vmatrix}^2_F\]
lambda, the regularization factor, should be kept relatively small, with the default of 1 providing good results in the vast majority of cases. The alternating least squares algorithm completes updates while the difference between objective function evaluations between iterations is less than thresh and the number of iterations completed is less than max.iters
At completion, new \(H_i\) are calculated, such that
\[ H_i = \arg \min_{H_i \geq 0}\begin{Vmatrix} \begin{pmatrix} W^T+V_i^T\\ \sqrt{\lambda}V_i^T \end{pmatrix}H_{indic,i}^T - \begin{pmatrix} E_i^T1\\ 0_{n_i\times m} \end{pmatrix} \end{Vmatrix}^2_F \]
I have found that utilization of the calculated \(H_i\) may be useful if substituted for the \(H_i\) calculated with iNMF in cases where cell types are difficult to separate, but I have not developed this application sufficiently.
After this step, a file, gene_signature_output.RDS, is saved to the deconvolution directory, containing the calculated \(W\), \(V\), \(H\), and one hot matrices.
After learning cell type signatures, the spatial data is subset to the selected genes, transposed, scaled not centered for each gene, and transposed again. After imposing nonnegativity on the transformed data, \(H_{deconv}\) is found by NMF regression, such that
\[H_{deconv} = \arg \min_{H \geq 0}\begin{Vmatrix} E_{spatial}-HW \end{Vmatrix})\]
A second matrix, \(H_{prop}\), is calculated by scaling the rows of \(H_{deconv}\) such that the rows sum to 1.
After this step, a file, deconvolution_output.RDS, is saved to the deconvolution directory, containing the calculated \(H_{deconv}\) and \(H_{prop}\)
deconvolve_spatialThe deconvolve_spatial function takes a large number of parameters. Here, they are listed and overviewed.
filepath - Path to directory within which the atlas structure was generatedregion - The anatomical region with which the analysis is associatedspatial.data.name - A string, the name of the spatial datasetannotation.level A value, (1,2,3), corresponding to granularity of known resolution to use in the event that reference annotations are utilized. 3 by defaultknown.annotations An object corresponding to cluster labels for single cells or a file path to a spreadsheet containing multiple levels of provided annotationsslide.seq - A logical, if the spatial data was collected with slide-seq overriding removal of genes from the spatial data. FALSE by defaultz - The number of standard deviations above the mean number of missing values at which to remove a gene from the spatial data. 1 by defaultn.umi.thresh - An integer, a minimum number of counts on selected spatial samples for selected genes. 150 by defaultn.cells - The maximum number of cells to select from each dataset for each cluster for use in gene selection and signature calculation. 500 by defaultdeconv.gene.num - The number of genes to utilize for gene signature calculation. 2000 by default.gene.num.tol - The maximum difference between deconv.gene.num and the true number of genes used. 50 by defaultlambda - Regularization parameter. Larger values penalize dataset-specific effects more strongly (ie. alignment should increase as lambda increases). 1 by defaultthresh - Convergence threshold. Convergence occurs when |obj0-obj|/(mean(obj0,obj)) < thresh. 1e-8 by default.max.iters - Maximum number of block coordinate descent iterations to perform. 100 by defaultnrep - Number of restarts to perform. 1 by defaultrand.seed Random seed to allow reproducible results. 1 by defaultprint.obj - Print objective function values after convergence. FALSE by defaultverbose Print progress bar/messages. TRUE by default