Handle gem files for downstream analysis
GEM is a tabular structure format specified for saving stereo-seq data, which generally contains columns of x, y, geneID, MIDCount and decodes the feature information captured by each DNB on the stereo-seq chip. Additional columns counld be included for DNB/feature annotation. A demo description can be found at stereopy site.
[1]:
import sys
sys.path.append('../../')
[2]:
from spacipy import Gem
Read in example gem data, which holds mouse brain stereo-seq data from here.
[3]:
gem = Gem.readin('../_static/data/example.gem.gz')
[4]:
gem
[4]:
| geneID | x | y | MIDCounts | |
|---|---|---|---|---|
| 0 | Cr2 | 10059 | 14612 | 1 |
| 1 | Cr2 | 6066 | 11227 | 3 |
| 2 | Cr2 | 10521 | 11108 | 1 |
| 3 | Cr2 | 8072 | 11782 | 3 |
| 4 | Cr2 | 7891 | 16372 | 1 |
| ... | ... | ... | ... | ... |
| 106663420 | CAAA01147332.1 | 12756 | 17013 | 1 |
| 106663421 | CAAA01147332.1 | 11907 | 12871 | 1 |
| 106663422 | CAAA01147332.1 | 14707 | 4862 | 1 |
| 106663423 | CAAA01147332.1 | 9787 | 5437 | 1 |
| 106663424 | CAAA01147332.1 | 10471 | 15907 | 1 |
106663425 rows × 4 columns
preliminary understanding of your data
GEM saves spatial transcriptomics data, which can be easily understanding by seeing what it looks like.
[5]:
gem.plot(color='nCount', bin_size=20)
None
[5]:
[<Axes: >]
here you may focus on two points: 1. whether the spatial distributuion of the data is well decoded 2. how many features were captured in each square bin or DNB
extract data where tissue covered
Not all DNBs on the stereo-seq chip were covered by the tissue. An effective way for deleting non-related data is to use a mask, which is often a two dimensional numpy ndarray with tissue regions labeled as non-zero integer (see here). Once prepared a ready-to-use mask matrix, you can process as following:
[6]:
import numpy as np
mask = np.loadtxt('../_static/data/example.transformed_mask.txt.gz')
mask.shape
[6]:
(14451, 14248)
[7]:
gem.img_shape
[7]:
(14451, 14248)
[8]:
gem = gem.mask(matrix=mask, return_offset=False, label_object=True)
now the gem data will be
[9]:
gem.plot(color='nCount', bin_size=20)
None
[9]:
[<Axes: >]
aggregate the data into cell/bin unit
After getting the effective data set, the next step is to determine how to group the nanometer-resolution DNB array into a analysis-ready unit, here always in two way: 1. group into square bin with equal width and height 2. group into putative cells
1. group into square bin
Square binning strategy do NOT consider any other informations like nucleus location and molecular homogeny. The only parameter in this step is bin_size, which specify how many DNBs in width and height should be grouped as a single bin unit. People can change this value based on the physical area of each bin and the captured features in each bin unit.
[10]:
binning_gem = gem.binning(bin_size=20)
binning_gem
[10]:
| geneID | x | y | label | MIDCounts | |
|---|---|---|---|---|---|
| 0 | 0610005C13Rik | 5780 | 12540 | 16200 | 3 |
| 1 | 0610005C13Rik | 5960 | 12760 | 14612 | 1 |
| 2 | 0610005C13Rik | 5980 | 10700 | 28985 | 3 |
| 3 | 0610005C13Rik | 6100 | 10240 | 32093 | 1 |
| 4 | 0610005C13Rik | 6140 | 11420 | 23757 | 2 |
| ... | ... | ... | ... | ... | ... |
| 31798037 | mt-Nd6 | 12880 | 8940 | 37309 | 2 |
| 31798038 | mt-Nd6 | 12880 | 8960 | 37122 | 2 |
| 31798039 | mt-Nd6 | 12880 | 9140 | 36185 | 1 |
| 31798040 | mt-Nd6 | 12900 | 8640 | 39087 | 2 |
| 31798041 | mt-Nd6 | 12900 | 8840 | 37995 | 3 |
31798042 rows × 5 columns
2. group into putative cells
Currently stereo-seq comprehend ssDNA stainning protocal, which stains the nucleus of cells on the same section. With the location of cell nucleus decoded, labeled with numpy mask file. And in the previous gem.mask process, we have labeled each DNB with their cell IDs by setting label_object=True.
[11]:
gem
[11]:
| geneID | x | y | MIDCounts | label | |
|---|---|---|---|---|---|
| 0 | Cr2 | 8072 | 11782 | 3 | 20638 |
| 1 | Cr2 | 7832 | 10872 | 1 | 27089 |
| 2 | Cr2 | 6608 | 10674 | 1 | 28950 |
| 3 | Cr2 | 9790 | 14297 | 2 | 3574 |
| 4 | Cr2 | 6594 | 12299 | 2 | 17562 |
| ... | ... | ... | ... | ... | ... |
| 37759281 | CAAA01147332.1 | 12231 | 11534 | 1 | 20817 |
| 37759282 | CAAA01147332.1 | 10776 | 12505 | 2 | 14553 |
| 37759283 | CAAA01147332.1 | 9810 | 10370 | 1 | 29920 |
| 37759284 | CAAA01147332.1 | 10653 | 12974 | 1 | 11713 |
| 37759285 | CAAA01147332.1 | 8062 | 11252 | 2 | 24214 |
37759286 rows × 5 columns
with cell ID labels, the gem file could be aggregated into cell-gene matrix as AnnData object
[13]:
adata = gem.to_anndata()
adata
[13]:
AnnData object with n_obs × n_vars = 41502 × 25288
the anndata object can be write into h5ad file and further used in clustering analysis
[15]:
adata.write('../_static/data/example.h5ad')