Data Format¶
Input Format¶
The primary input format for BayesTME is AnnData
The input AnnData object is expected to have the following fields:
adata.X- N spots x N markers integer matrix representing read counts.adata.obsm['spatial`]- spatial coordinates of the reads in the tissue slide.adata.obs['in_tissue`]- boolean array indicating whether a read comes from a tissue or non tissue spot.adata.uns['layout']- either SQUARE, HEX or IRREGULAR, corresponding to the spot layout geometry.adata.obsp['connectivities']- sparse boolean matrix indicating whether two spots neighbor each other or not.
This AnnData scheme is designed to be compatible with the scheme used by scanpy.
We have a provided a helper method, bayestme.data.SpatialExpressionDataset.read_spaceranger(),
for creating the above AnnData object from raw spaceranger output.
Output Format¶
There are 4 high level data classes that represent the outputs of the 4 steps:
bayestme.data.BleedCorrectionResult- this is the output of bleeding correctionbayestme.data.PhenotypeSelectionResult- this is the output of one phenotype selection job in the larger grid search.bayestme.data.DeconvolutionResult- this is the output of sampling the deconvolution posterior distribution.bayestme.data.SpatialDifferentialExpressionResult- this is the output of sampling the spatial differential expression posterior distribution.
These are all saved as hdf5 archives on disk, which is a format also used by AnnData.
bleeding_correction produces two outputs, one is a modified version of the input
bayestme.data.SpatialExpressionDataset object with the read counts adjusted,
the other is the bayestme.data.BleedCorrectionResult object which contains the basis functions,
global, and local weights.
phenotype_selection will produce one bayestme.data.PhenotypeSelectionResult object
for each node in the grid search. Having one output per parameter set enables parallelization.
deconvolve will produce a bayestme.data.DeconvolutionResult object in h5 format,
which represents all the raw samples from the posterior distribution.
This object can be quite large (>10GB) as it is contains very high dimensional arrays of floating point numbers.
deconvolve will also modify the AnnData archive to add meaningful summary statistics
from these posterior samples which are used in data visualization and analysis.
The AnnData fields added by this step are as follows:
Deconvolve AnnData Fields¶
adata.uns['bayestme_n_cell_types']- integer, number of cell typesadata.varm['bayestme_cell_type_counts']- <N marker> x <N cell type matrix> with the average posterior count of each cell type in each spotadata.varm['bayestme_cell_type_probabilities']- <N marker> x <N cell type matrix> with the cell type probability of each cell type in each spot
select_marker_genes will modify the AnnData archive add to indicators of which genes are marker genes for each cell type, and their order of significance.
The AnnData fields added by this step are as follows:
Marker Gene AnnData Fields¶
adata.varm['bayestme_cell_type_marker']- <N marker> x <N cell type> integer matrix. Set to -1 if gene is not a marker gene for cell type, otherwise set to monotonically increasing 0-indexed integers indicating marker gene significance.adata.varm['bayestme_omega_difference']- <N marker> x <N cell type> floating point matrix. This statistic represents the “overexpression” of a gene in a cell type, and is used for scaling the dot size in our marker gene plot.adata.varm['bayestme_omega']- <N marker> x <N cell type> floating point matrix. ω_kg from equation 6 of the preprintadata.varm['bayestme_relative_expression']- <N marker> x <N cell type> floating point matrix. Average expression in this cell type, minus the max expression in all other cell types, divided by the maximum expression in all cell types. A higher number for this statistic represents a better candidate marker gene. This statistic is used as a tiebreaker criteria for marker gene selection when omega_kg values are equal.
spatial_transcriptional_programs will produce a bayestme.data.SpatialDifferentialExpressionResult
object in h5 format which contains estimates of the model parameters.