library(scran)
library(scater)
library(SCHNAPPs)
library(BiocParallel)
register(SerialParam()) # avoid problems with fastMNN parallelization.
set.seed(100)

Introduction

This vignette shows how SCHNAPPs can be applied as a single-cell RNA-sequencing platform by comparing it to a standard workflow as decribed under SCRAN (https://bioconductor.org/packages/release/bioc/vignettes/scran/inst/doc/scran.html).

Data setup

We pre-compute the scran workflow and make its results available for comparison within SCHNAPPs.

We use the same data set. Here, we have to create a singleCellExperiment object and save it in an external file to be able to load into SCHNAPPs later.

if (! "scRNAseq" %in% installed.packages()){
    BiocManager::install("scRNAseq")}
library(scRNAseq)
sce <- GrunPancreasData()
sce
## class: SingleCellExperiment 
## dim: 20064 1728 
## metadata(0):
## assays(1): counts
## rownames(20064): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
##   ZZZ3__chr1
## rowData names(2): symbol chr
## colnames(1728): D2ex_1 D2ex_2 ... D17TGFB_95 D17TGFB_96
## colData names(2): donor sample
## reducedDimNames(0):
## mainExpName: endogenous
## altExpNames(1): ERCC

We prepare the singleCellExperiment object to contain the col/row Data that is needed by SCHNAPPs. At this point, which is usually performed by the bioinformatician who is preparing the data, we can also add other information and documentation.

colData has to contain the columns barcode = a unique identifier per experiment sampleNames = a name for a collection of cells.

rowData has to contian the columns symbol = the gene symbol Description = a text describing the gene id = unique identifier, e.g. ENSG number

qcstats <- scater::perCellQCMetrics(sce)
colData(sce) = cbind(colData(sce), qcstats)
colData(sce)$barcode = rownames(colData(sce))
colData(sce)$sampleNames = factor(colData(sce)$sample)

rowData(sce)$Description = ""
rowData(sce)$id = rownames(rowData(sce))

To show the effect of the scater filtering we also calculate those. Missing values are handled as genes with high ERCC expressing genes.

qcfilter <- quickPerCellQC(qcstats, percent_subsets="altexps_ERCC_percent")
qcfilter[is.na(qcfilter$high_altexps_ERCC_percent),"high_altexps_ERCC_percent"] = T
colData(sce) = cbind(colData(sce), qcfilter)
# here we redo the scran workflow with different clustering algorithms 
# and store the results in the singleCellExperiment object that we are going to use in SCHNAPPs
sceOrg = sce
sce <- sce[,!qcfilter$discard]
# summary(qcfilter$discard)

library(scran)
clusters <- quickCluster(sce)

sce <- computeSumFactors(sce, clusters=clusters)
# summary(sizeFactors(sce))

sce <- logNormCounts(sce)

dec <- modelGeneVar(sce)
# plot(dec$mean, dec$total, xlab="Mean log-expression", ylab="Variance")
# curve(metadata(dec)$trend(x), col="blue", add=TRUE)

dec2 <- modelGeneVarWithSpikes(sce, 'ERCC')

# Get the top 10% of genes.
top.hvgs <- getTopHVGs(dec, prop=0.1)
sce <- runPCA(sce, subset_row=top.hvgs)
# reducedDimNames(sce)
sced <- denoisePCA(sce, dec2, subset.row=getTopHVGs(dec2, prop=0.1))
# ncol(reducedDim(sced, "PCA"))
output <- getClusteredPCs(reducedDim(sce))
npcs <- metadata(output)$chosen
reducedDim(sce, "PCAsub") <- reducedDim(sce, "PCA")[,1:npcs,drop=FALSE]
# npcs
# In this case, using the PCs that we chose from getClusteredPCs().
g1 <- buildSNNGraph(sce, use.dimred="PCAsub")
cluster <- igraph::cluster_walktrap(g1)$membership
sce$cluster <- factor(cluster)
# table(sce$cluster)

g1a <- buildSNNGraph(sce, use.dimred="PCAsub")
cluster <- igraph::cluster_walktrap(g1a)$membership
sce$cluster1a <- factor(cluster)

# In this case, using the PCs that we chose from getClusteredPCs().
g <- buildSNNGraph(sce, use.dimred="PCAsub", type = "jaccard")
cluster <- igraph::cluster_walktrap(g)$membership
sce$cluster2 <- factor(cluster)

g <- buildSNNGraph(sce, use.dimred="PCAsub", type = "number")
cluster <- igraph::cluster_walktrap(g)$membership
sce$cluster3 <- factor(cluster)



# table(sce$cluster) 
sce <- runTSNE(sce, dimred="PCAsub")
# plotTSNE(sce, colour_by="cluster", text_by="cluster")
# plotTSNE(sce, colour_by="cluster2", text_by="cluster2")
# plotTSNE(sce, colour_by="cluster3", text_by="cluster3")


library(pheatmap)
ass.prob <- bootstrapCluster(sce, FUN=function(x) {
    g <- buildSNNGraph(x, use.dimred="PCAsub")
    igraph::cluster_walktrap(g)$membership
}, clusters=sce$cluster)

sceOrg$cluster = 0
sceOrg$cluster1a = 0
sceOrg$cluster2 = 0
sceOrg$cluster3 = 0
names(sceOrg$cluster) = colnames(sceOrg)
names(sceOrg$cluster1a) = colnames(sceOrg)
names(sceOrg$cluster2) = colnames(sceOrg)
names(sceOrg$cluster3) = colnames(sceOrg)
sceOrg$cluster[colnames(sce)] = sce$cluster
sceOrg$cluster1a[colnames(sce)] = sce$cluster1a
sceOrg$cluster2[colnames(sce)] = sce$cluster2
sceOrg$cluster3[colnames(sce)] = sce$cluster3
sceOrg$cluster = as.factor(sceOrg$cluster)
sceOrg$cluster1a = as.factor(sceOrg$cluster1a)
sceOrg$cluster2 = as.factor(sceOrg$cluster2)
sceOrg$cluster3 = as.factor(sceOrg$cluster3)

# pheatmap(ass.prob, cluster_cols=FALSE, cluster_rows=FALSE,
#     col=colorRampPalette(c("white", "blue"))(100))

We save all results including the cluster assignments from the scran workflow to be able to compare to SCHNAPPs.

save(file = "scran.grunPancreas.RData", list = c("sceOrg"))

SCHNAPPs

Now start SCHNAPPs.

Please see the help and other documentation on how to enable tracking data changes and saving plots (historyPath option). The defaultValues allows us to preset any user defineable variable in the GUI.

This is partially done to simplify your life as you don’t have to set all the parameters partially to avoid mistakes and for this to show the correct set-up.

The variables are described in the code chunk. To note: the number of genes to use is set to 951, which is the same value as in the scran vignette. We don’t have a way to calculate this. But we are aware that changing this value has substaintial effects for the cluster assignment. We strongly believe that the optimization of these parameters is not trivial and we are currently looking for solutions how to solve this. Similar argumenation goes for the number of PCs to use for the clustering algorithm.

library(SCHNAPPs)
defaultValues = list()

# input
defaultValues[["sampleInput"]] =FALSE # don't subsample
defaultValues[["whichscLog"]] = "calcLog" # calculate normalization

# Normaliztion
defaultValues[["normalizationRadioButton"]] = "DE_scaterNormalization" # use scaterNorm

# General parameters
## PArameters for PCA
defaultValues[["pcaScale"]] = FALSE # don't scale
defaultValues[["pcaN"]] = 951 # number of genes to use.
defaultValues[["pcaRank"]] = 14 # number PCs to use for clustering


# Parameters for clustering
defaultValues[["tabsetCluster"]] = "snnGraph" # use SNN graph function

# preset Alluvial plot axes
defaultValues[["alluiv1"]] = "cluster"
defaultValues[["alluiv2"]] = "dbCluster"


# Gene selection
# we usually remove genes from mitochrondrium and Ribosome associated proteins using the following regular expression ("^MT-|^RP|^MRP"). This is not done in the scran vignette 
defaultValues[["selectIds"]] = ""
defaultValues[["minGenes"]] = 1 # min number of reads per cell



# Cell selection
defaultValues[["minGenesGS"]] = 1 # minimum number of UMIs (sum) over all cells.
# cells to be removed
defaultValues[["cellsFiltersOut"]] = [4047 chars quoted with '"']

schnapps(defaultValues = defaultValues)

Removal of cells

This section shows how to remove the cells that have been identified in the scran vignette. To follow we have to remove the cells already selected. and set “Cells to be removed” under Cell selection - additional parameters to an empty string as shown below:

Filter based on ERCC percentage

Filter based on ERCC percentage

and click on apply changes.

Then load the data we just created.

Filter based on ERCC percentage

Filter based on ERCC percentage

We now show which filters have been applied to the data in the scran workflow. We use the perCellQCMetrics metrics that cannot be calculated from within SCHNAPPs but could be precalculated. We can then visualize the cells that are filtered out and recover the thresholds used in individual steps:

based on ERCC percentage:

This percentage is precalculated from the scr

The data can be visualized in the 2D plot (co-expression - selected) with the following paramters: X : altexps_ERCC_percent Y: sample color: high_altexps_ERCC_Percent

Filter based on ERCC percentage

Filter based on ERCC percentage

The selected cell names can be copy/pasted to the “Cells to be removed” under Cell selection - additional parameters” as described above.

filter low expressing cells

Using a similar 2D plot with the following paramters: X : detected Y: sample color: low_n_features

we can select and remove cells with low expressing features as described above.

Filter based on detected genes

Filter based on detected genes

We end up with the same 1219 cells as in the vignette from scran.

Normalization / clustering

We use “scaterNorm,” which is applying exactly the same procedure that is used in the vignette. The user has several other options available that are better described in the online documentation.

The snnGraph method is used for the clustering.

Results are the same

The alluvial plot between the cluster assignment from the scran package and the cluster assignment from SCHNAPPs (dbCluster) shows that they are the same.

Cluster assignment comparison using Alluvial plot

Cluster assignment comparison using Alluvial plot

The same result can be concluded from plotting the cluster assignments in a 2D plot:

Cluster assignment comparison using 2D plot

Cluster assignment comparison using 2D plot

Characterization of results

The following will show a few possible results that can be achieved using SCHNAPPs:

Co-expression - All clusters

Co-expression - All clusters; After list of gene names has been set to empty SCHNAPPs will automatically fill the list with characteristic genes for each cluster.

Co-expression - All clusters; After list of gene names has been set to empty SCHNAPPs will automatically fill the list with characteristic genes for each cluster.

The following video shows how to use the heatmap module. If the field with the gene names is empty the function FindMarkers is used to identify the genes that are most characteristic for each cluster. This list of genes is used for the heatmap.

The genes CPA1, CTRB1, PLA2G1B, PRSS1, PRSS3P2 are highly expressed in cluster 2. GCG, TTR are highly expressed in cluster 5.

Co-expression - selected

In the following we describe a few possibilities of the 2D plot

co-expression of two lists of genes

We can now compare the genes previously found and see how these cells express genes from different clusters.

2D plot of the sum of expression of two different lists of genes, colored by cluster association.

2D plot of the sum of expression of two different lists of genes, colored by cluster association.

The cell density can be used to identify where the majority of cells lies or if and where many cells are drawn on top of eachother.

2D plot of the sum of expression of two different lists of genes, colored by density of cells.

2D plot of the sum of expression of two different lists of genes, colored by density of cells.

The composition of clusters relative to the donor information can be also using the same tool as shown below:

Histogram of cell count per cluster assignment and split by donor information.

Histogram of cell count per cluster assignment and split by donor information.

UMAP projection. Selecting and naming of a group of cells.

UMAP projection. Selecting and naming of a group of cells.

Cells can be grouped and named using the 2D plot, Those are are binary groups, i.e. a cells belongs either to the group or not.

The following video shows how to name groups of cells: * Cells have to be selected in the 2D plot. * Give a name * ensure that the reference for the selection is set to “plot” * click “change current selection”

To note: * in-between, the cells seem not to be selected anymore, but that is a “feature” of the underlying visualization functions. * using the a previously defined group of cells allows to avoid selecting cells that are potentially hidden. This can happen, when the 2D coordinates have changed. This also allows a more cytometry way of thinking, where concurrent selections of cells are used to filter.

After having selected multiple groups, we have these groups as projections available with TRUE/FALSe assignment for each cell. Sometimes, we want to combine these projections, e.g. if there are more than two cell types. This is done under Parameters - Projections - combine projections:

  • Select the two projections that should be combined
  • name the new projection
  • repeat as needed.

Since it is not ensured that these projections are not overlapping, the individual values are concatenated and a new factor is created. The levels of these factors can be changed under the “rename levels” tab. The tables allows verifying the meaning of the levels.

When renaming the levels, don’t forget to add a meaningfull name, otherwise “X” is chosen for you.

If this happens, don’t worry, this can be changed in the “rename projections” tab, where projections can be deleted, but only those you have created manually.

Violin plots

The violin plot can show all possible combinations if “show Permutations” is selected, otherwise a normal violoin plot is shown (not shown here).

Violin plot with permutataions

Violin plot with permutataions

Under “grouped” a second variable can be used. Here, grp1-7 are clusters of cells that have been identified from the UMAP presentation.

grouped violin plots

grouped violin plots

Three dimensional plot

The sum of expression for a list of genes is shown in the 3D tSNE plot that can be found under Data Exploration -

3D tSNE plot with expression of specified genes

3D tSNE plot with expression of specified genes

Panel Plot

Panel Plot

Differential gene expression (DGE)

This short video shows how to perform a DGE analysis. The individual steps are also explained afterwards in more detail.

selections of groups of cells

The “selection of input cells” allows subsetting the cells even before the visualization.

Cells to be compared can be either selected in the two plot directly, of if cells are only selected in the first plot, all other cells are being used.

DGE between specific groups of cells

DGE between specific groups of cells

Volcano plot

Genes can be selected in the volcano plot and their names/symbols are shown above the plot to be copy/pasted.

DGE results as Volcano plot

DGE results as Volcano plot

DGE table

A complete table is available, that can also be downloaded.

Table of DGE

Table of DGE

Comparing different clustering methods

SCHNAPPs uses the projection “dbCluster” for the current clustering result. In order to compare different clustering procedures the rename projection tool allows the user to copy the results do a different name.

Projections can be renamed

Projections can be renamed

Please help

We hope you find SCHNAPPs useful. Please don’t forget to cite our work when you publish, nor tell us if you find any bugs or have questions/suggestions.

bernd jagla pasteur fr