CZ CELLxGENE
Preface
As more single cell datasets become available, processing and exploring individual datasets is no longer ideal. Platforms such as the Chan Zuckerberg CELL by GENE Discover provide the perfect solution for researchers to quickly locate, explore, and analyse a collection of datasets that will help answering specific scientific questions. I included a small case study to serve as an example for using the CELLxGENE platform.
I started mining single cell datasets in early 2020. At the time, there were a good number of datasets available from various public repositories. However, different research groups deposited their data in very different ways. A great amount of my time was spent on processing and reshaping deposited data in various forms, including raw sequencing data (fastq files), raw counts (gene-cell matrices), processed counts (post filtering), into gene-cell matrices that I can work with. The metadata was either embedded in the cell names in the count matrix or provided in separate files. A platform where all datasets were processed and annotated in a standardised pipeline was lacking. Nowadays, thanks to great platforms such as the Chan Zuckerberg CELL by GENE Discover, researchers can quickly find and explore a collection of datasets for a specific question in no time.
Let's go through a small case study together. The scientific question I'm interested in here is: How does the pulmonary endothelial transcriptome change in non-small cell lung cancer?
FIND YOUR DATA. We go to the Collections tab of the platform, which has a list of filters, including Assay, Author, Cell Type, Developmental Stage, Disease, Organism, Publication Date, Self-Reported Ethnicity, Sex, and Tissue. We can search non-small cell lung cancer using the Disease filter. You can see there is only one study available from Salcher et al. (2022) bioRxiv. A further check on the Tissue and Disease metadata shows that this collection includes 6 tissues (adrenal tissue, brain, liver, lung, lymph node, pleural effusion) and 4 diseases (chronic obstructive pulmonary disease, lung adenocarcinoma, non-small cell lung carcinoma, and squamous cell lung carcinoma).
CHOOSE YOUR DATA. If we click on the collection, we will see two datasets available, core atlas and extended atlas, with 892,296 and 1,283,972 cells respectively. We can download/explore the data by clicking the corresponding icons. Let's explore the extended atlas first.
EXPLORE YOUR DATA. If we click the "Explore" icon, we will enter the data Explorer. You will see a three-column page. On the left, there are clickable "Standard Categories" providing metadata such as "assay", "cell_type", "donor_id", as well as "Author Categories", showing details for "tumor_stage" and "ever_smoker" for this specific study. Then there are several standard QC histograms. In the middle, it is the UMAP projection of the cells you are interested in and function buttons that help subset the data and perform differentially expressed gene analysis. On the right, you can check the expression of individual genes or custom defined gene sets. Let's take this case for a spin. Remember that OUR KEY WORDS in the question are:
pulmonary -> lung tissue
endothelial -> endothelial cells
transcriptional changes -> differentially expressed gene analysis
non-small cell lung cancer -> disease type
-> Select normal (212,889 cells) and non-small cell lung cancer (120,796 cells) in the "disease" under "Standard Categories".
-> Select lung (1184127 cells) in the "tissue" under "Standard Categories".
-> Select Endothelial cell (47,421 cells) in the "ann_coarse" under "Author Categories". (The granularity of the standard annotation is too high so we use the cell type annotated by the author.)
-> Click the "droplet" icon next to "ann_coarse". You will see the annotation of different cell types in the UMAP projection, including endothelial cells (Figure 1).
-> Toggle the ticked boxes in the "disease" under "Standard Categories" and click the subset data boxes in the middle column, right above the clustering map. You will see 1262 pulmonary endothelial cells from non-small cell lung cancer and 12,820 cells from normal lung tissues.
-> Click the "subset" icon on the right to the two cell number boxes to initiate differentially expressed gene (DEG) analysis.
-> Two new rows will appear once the DEG analysis is done, "Pop1 high" and "Pop2 high", for genes that are upregulated and downregulated in the cancer endothelial cells compared to the endothelial cells in the normal tissue.
-> Click the arrow symbol next to "Pop1 high" to expand the list. You will see the list of up-regulated genes including IGKC, MTND4P33, RPL17, MTCO3P18, IGLC2, RPL41, IGHA1, etc.
-> Click the "droplet" icon next to IGKC. You will see expression histograms for IGKC appear next to each of the metadata factors we selected in the column on the left, as well as the expression heatmap painted over the clustering map. The histogram for non-small cell carcinoma shows a clear low peak, while there is little expression in the normal. Looking closely at the clustering map, you can also see the expression of this gene is mainly in the Plasma cells, not the Endothelial cells (Figure 2).
-> Let's look at another gene MTND4P33. Click the "droplet" icon to look at the expression levels in the heatmap. You will see spots with high expression levels in the endothelial cell cluster (Figure 3).
-> Click the arrow symbol next to "Pop2 high" to expand the list. You will see the list of down-regulated genes including MTND4LP30, ALDOA, DLX5, WWTR1, MT2A, IFITM2, etc.
-> Click the "droplet" icon next to NFKBIA and expand "ann_fine" to look at vessel type specific endothelial cells, Endothelial cell arterial (3,793 cells), Endothelial cell capillary (16,898 cells), Endothelial cell lymphatic (5,887), and Endothelial cell venous (20,843 cells). From the histograms next to each subtype, you can see NFKBIA is expressed at higher levels in the arterial endothelial cells than the other subtypes.
-> Look under the "Author Categories", you will see the annotation of common mutations including the ALK, BRAF, EGFR, ERBB2, KRAS, ROS, and TP53 mutations. You can then explore the gene expression pattern for each mutation type as well, which is very relevant for non-small cell lung cancer.
-> You can keep exploring and let your curiosity lead you onward.
4. DOWNLOAD THE DATA. If you like to explore the data yourself, you can download the data on the collection information page. Click the "download" symbol and you will see two formats to download for the dataset of interest, .h5ad (AnnData v0.8) and .rds (Seurat v4), accommodating both R and Python users. No more conversion needed.
Hopefully this will serve as a useful introductory protocol to help researchers start exploring the vast amount of single cell RNA-seq data curated, processed, and presented by CZ CELLxGENE.
Feel free to drop me a line for further discussion. Let's learn and grow together.
Figure 1 Annotated clustering.
Figure 2 IGKC expression.
Figure 3 MTND4P33 expression.
Reference
Chanzuckerberg Initiative. (n.d.). CZ CELLxGENE Discover. Retrieved (insert date here), from https://cellxgene.cziscience.com/
Salcher et al., 2013. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. bioRxiv doi: 10.1101/2022.05.09.491204