Data Format

Input Format

The primary input format for BayesTME is AnnData

The input AnnData object is expected to have the following fields:

  • adata.X - N spots x N markers integer matrix representing read counts.

  • adata.obsm['spatial`] - spatial coordinates of the reads in the tissue slide.

  • adata.obs['in_tissue`] - boolean array indicating whether a read comes from a tissue or non tissue spot.

  • adata.uns['layout'] - either SQUARE or HEX, corresponding to the probe layout geometry.

  • adata.obsp['connectivities'] - sparse boolean matrix indicating whether two observations neighbor each other in the probe grid or not.

This AnnData scheme is designed to be compatible with the scheme used by scanpy.

We have a provided a helper method, bayestme.data.SpatialExpressionDataset.read_spaceranger(), for creating the above AnnData object from raw spaceranger output.

Output Format

There are 4 high level data classes that represent the outputs of the 4 steps:

These are all saved as hdf5 archives on disk, which is a format also used by AnnData.

bleeding_correction produces two outputs, one is a modified version of the input bayestme.data.SpatialExpressionDataset object with the read counts adjusted, the other is the bayestme.data.BleedCorrectionResult object which contains the basis functions, global, and local weights.

phenotype_selection will produce one bayestme.data.PhenotypeSelectionResult object for each node in the grid search. Having one output per parameter set enables parallelization.

deconvolve will produce a bayestme.data.DeconvolutionResult object in h5 format, which represents all the raw samples from the posterior distribution. This object can be quite large (>10GB) as it is contains very high dimensional arrays of floating point numbers. deconvolve will also modify the AnnData archive to add meaningful summary statistics from these posterior samples which are used in data visualization and analysis.

The AnnData fields added by this step are as follows:

Deconvolve AnnData Fields

  • adata.uns['bayestme_n_cell_types'] - integer, number of cell types

  • adata.varm['bayestme_cell_type_counts'] - <N marker> x <N cell type matrix> with the average posterior count of each cell type in each spot

  • adata.varm['bayestme_cell_type_probabilities'] - <N marker> x <N cell type matrix> with the cell type probability of each cell type in each spot

select_marker_genes will modify the AnnData archive add to indicators of which genes are marker genes for each cell type, and their order of significance.

The AnnData fields added by this step are as follows:

Marker Gene AnnData Fields

  • adata.varm['bayestme_cell_type_marker'] - <N marker> x <N cell type> integer matrix. Set to -1 if gene is not a

marker gene for cell type, otherwise set to monotonically increasing 0-indexed integers indicating marker gene significance. - adata.varm['bayestme_omega_difference'] - <N marker> x <N cell type> floating point matrix. This statistic represents the “overexpression” of a gene in a cell type, and is used for scaling the dot size in our marker gene plot.

spatial_expression will produce a bayestme.data.SpatialDifferentialExpressionResult object in h5 format which represents all the raw samples from the posterior distribution. This object can be quite large (>10GB) as it is contains very high dimensional arrays of floating point numbers.