# slalom documentation¶

slalom is a scalable modelling framework for single-cell RNA-seq data that can be used to dissect and model single-cell transcriptome heterogeneity, thereby allowing to identify biological drivers of cell-to-cell variability and model confounding factors.

Software by Florian Buettner and Oliver Stegle. slalom is explained in detail in the accompanying publication [1].

[1] Buettner, F.,Pratanwanich, N., Marioni, J., Stegle, O. Scalable latent-factor models applied to single-cell RNA-seq data separate biological drivers from confounding effects. Submitted

## Installation¶

slalom requires Python 2.7 or newer with

• scipy, h5py, numpy, matplotlib, scikit-learn, re

slalom can be installed via pip with pip install slalom. For best results, we recommend the Anaconda python distribution (https://anaconda.org).

## Quickstart¶

### Input¶

slalom requires two input files, a gene expression file and an annotation file. The gene expression file is a text file containing the normalised, log-transformed gene expression matrix, with every row corresponding to a cell. Column names should be gene identifiers matching those in the annotation file (i.e. if gene symbols are used in the annotation file, column names should also be gene symbols). Row names are optional and can eg be a known covariate which is used for plotting. The annotation file is a text file with every row containing the name of a gene set, followed by the gene identifiers annotated to that gene set. We recommend using annotations such as those published in the REACTOME database or the Molecular signature database (MSigDB) and provide an annotation file containing annotations from the RECTOME database. The license of MSigDB does not permit redistribution of the raw annotation files. To use MSigDB annotations, please register at http://software.broadinstitute.org/gsea/msigdb, download the hallmark gene sets (gene symbols) and place the file in data folder. Both text files can then be loaded using the load_text function.

NB: slalom works best on a subset of highly variable genes; these can be identified using a variance filter based on a mean-variance trend fitted using spike-in transcripts or endogenous genes. A step-by-step workflow on low-level processing of scRNA-seq data (including gene filtering) can be found here: https://f1000research.com/articles/5-2122/v2

### Model initialisation and fitting¶

An slalom model can be initialised using the initFA convenience function. Arguments can be used to specify options, incuding number of unannotated factors (nHidden), minimum number of genes in a pathway (minGenes), whether to use the fast option by pruning genes (pruneGenes), noise model (noise) and the data directory (data_dir). Once a model is initialised, it can be fit using the train method.

### Diagnostics, plotting and saving.¶

The printDiagnostics function can be used to print diagnositcs based on the number of genes included/excluded by the model. If more than 100% of gene annotations are changed for at least one factor, the model should be re-fitted with sparse unannotated facotrs.

## Tutorial¶

All steps required to run slalom are illustrated in a jupyter notebook that can be viewed interactively.

## The factorial single-cell latent variable model (slalom)¶

A detailed statistical description of the slalom model can be found in teh accompanyin publicaiton [1]. Here, a brief summary is given. slalom is based on a variant of matrix factorization, decomposing the observed gene matrix into a sum of sum of contributions from A annotated factors, whose inference is guided by pathway gene sets, and H additional unannotated factors:

$\mathbf{Y} = \underbrace{\sum_{a=1}^{A} \mathbf{p}_a \mathbf{R}_a^{T}}_{\text{annotated factors}} + \underbrace{\sum_{h=1}^{H} \mathbf{s}_h \mathbf{Q}_h^{T}}_{\text{unannotated factors}} + \underbrace{\mathbf{\psi}}_{\text{residuals}}.$

Here, the vectors $$\mathbf{p}_a$$ and $$\mathbf{s}_h$$ are factor states for annotated and unannotated factors and $$\mathbf{R}_a$$ and $$\mathbf{Q}_h$$ are the corresponding regulatory weights of a given factor on all genes. The matrix $$\mathbf{psi}$$ denotes residual noise. For the statistical derivation in the accompanying publication and the implementation, we express this mode using matrix notation, collapsing the factors into a factor activation matrix $$\mathbf{X} = [\mathbf{r}_1,\dots,\mathbf{r}_A,\mathbf{s}_1,\dots,\mathbf{s}_H]$$ (with the comma denoting concatenation of columns), where each factor is enumerated using an indicator $$k = 1 \dots K$$, and K denotes the total number of fitted factors $$K = A + H$$. The analogous matrix representation is used for weights $$\mathbf{W}$$, resulting in

$\mathbf{Y} = \mathbf{X}\mathbf{W}^T +\mathbf{\psi} .$

We employ two levels of regularization on the parts of the weight matrix $$\mathbf{W}$$ corresponding to annotated factors. First, gene sets are used to guide a spike-and-slab prior on the rows of $$\mathbf{W}$$ thereby confining the inferred weights to the set of genes annotated in the pathway database. To this end $$\mathbf{W}$$ is modelled as elementwise product of a Bernoulli random variable $$\mathbf{Z}$$, indicating whether a gene is active for a given factor and a Gaussian random variable $$\widetilde{\mathbf{W}}$$, quantifying the corresponding effect size (for details see [1]). A second level of regularization is then used to achieve sparseness on the level of factors, allowing the model to deactivate factors that are not needed to explain variation in the data; this is achieved using an automatic relevance determination (ARD) prior (i.e. factor-specific Gamma prior on the precision of the weights). The inverse of this ARD prior ($$1/\alpha_k$$) can be interpreted as a measure of the regulatory impact of factor $$k$$ and corresponds to the expected variance explained by this factor, for the subset of genes with a regulatory effect. It is therefore also referred to as relevance parameter. The slalom software implements an efficent deterministic approximate Bayesian inference scheme based on variational methods, allowing for the inference of $$\mathbf{X}$$, $$\mathbf{Z}$$, $$\widetilde{\mathbf{W}}$$, $$\mathbf{\alpha}$$, $$\mathbf{\psi}$$ and other parameters.

slalom.load_txt(dataFile, annoFiles, niceTerms=True, annoDBs='MSigDB', dataFile_delimiter=', ', verbose=True)

Load input file for slalom from txt files.

Loads an txt files and extracts all the inputs required by slalom

Parameters: dataFile (str) – Strong containing the file name of the text file with the expression levels dataFile_delimiter (str) – delimiter for reading the data_file. Defaults to ‘,’. annoFiles (str, list) – Either string containing the file name of the txt file with the gene set annotations or a list containing several anotation files. Each line in in an annotattion file corresponds one gene set; a line starts with the name of the gene set and is followed by the annotated genes. annoDBs (str, list) – database file (MsigDB/REACTOME). If several annotation files are provided this hast to be a list of the same length. niceTerms (bool) – Indicates whether to nice terms (omit prefix, capitalize, shorten). Defaults to true. dataFile_delimiter – Delimiter used in dataFile; defaults to ‘,’. verbose (bool) – Show progress on loading terms (defaults to True). An dictionary containing all the inputs required by slalom.
slalom.load_hdf5(dFile, anno='MSigDB')

Loads an hdf file and extracts all the inputs required by slalom

Parameters: dFile (str) – String contaning the file name of the hdf file with the input data. An dictionary containing all the inputs required by slalom.
slalom.initFA(Y, terms, I, gene_ids=None, nHidden=3, nHiddenSparse=0, pruneGenes=True, FPR=0.99, FNR=0.001, noise='gauss', minGenes=20, do_preTrain=True, nFix=None, priors=None, covariates=None, dropFactors=True, learnPi=False)

Initialise the slalom factor analysis model.

Required 3 inputs are first, a gene expression matrix Y containing normalised count values of N cells and G variable genes in log-space, second a vector terms contaning the names of all annotated gene set (correspondig to annotated factors) and third, a binary indicator matrix I linking G genes to K terms by indicating which genes are annotated to each factor. A variety of options can be specified as described below.

Parameters: Y (array_like) – Matrix of normalised count values of N cells and G variable genes in log-space. Dimension ($$N\times G$$). terms (vector_like) – Names of K annotated gene sets. Dimension ($$K\times 0$$). I (array_like) – Indicator matrix specifying whether a gene is annotated to a specific factor. Dimension ($$G\times K$$). gene_ids (array_like) – Gene identifiers (opitonal, defaults to None) FNR (float) – False negative rate of annotations. Defaults to 0.001 FPR (float) – False positive rate of annotations. Defaults to 0.99 nHidden (int) – Number of unannotated dense factors. Defaults to 3. nHiddenSparse (int) – Number of unannotated sparse factors. Defaults to 0. This value should be changed to e.g. 5 if the diagnositcs fail. pruneGenes (bool) – prune genes that are not annotated to a least one factor. This option allows fast inference and should be set to True either if the key objective is to rank factors or if the annotations cover all genes of interest. Defaults to True. dropFactors (bool) – drop factors from update schedule once they are shut off. In practice, factors that are switched off at some point during inference are usuallly not switched off. Allows faster inference. Defaults to True. Currently only supported for the Gaussian noise model. noise (str) – Specifies the observation noise model. Should be either ‘gauss’,’hurdle’ or ‘poisson’. Defaults to gauss. minGenes (int) – minimum number of genes required per term to retain it Defaults to 20. do_preTrain (bool) – Boolean switch indicating whether pre-training should be used to establish the initial update order. Can be set to False for very large datasets. Defaults to True priors (dict) – Dictionary containing the hyperparameters of the priors for Alpha, Eps and Pi (PiDense and PiSparse). Defaults to None; in this case default values are used. covariates (array_like) – Matrix with known covariates that are controlled for when fitting the model. Defaults to None. learnPi (bool) – Learn sparsity of spearse hidden factors (default False) A slalom.CSparseFA instance.
slalom.preTrain(Y, terms, P_I, noise='gauss', nFix=None, priors=None, covariates=None)

Pre-train the slalom factor analysis model.

Helper function to pre-train the slalom factor analysis model to achieve faster convergence and obtain an initial update order. Called by initFA.

Parameters: Y (array_like) – Matrix of normalised count values of N cells and G variable genes in log-space. Dimension ($$N\times G$$). terms (vector_like) – Names of K annotated gene sets. Dimension ($$K\times 0$$). P_I (array_like) – Matrix specifying the likelihood of whether a gene is annotated to a specific factor. Dimension ($$G\times K$$). noise (str) – Specifies the observation noise model. Should be either ‘gauss’,’hurdle’ or ‘poisson’. Defaults to gauss. nFix (int) – Number of terms which should be fixed and updated first. Defaults to None, resulting in the number of unannotated factors being updated first. A vector containing the initial update order of the terms

## Model fitting¶

class slalom.core.CSparseFA(init_data=None, **parameters)

Variational Bayesian Factor analysis module. AExpressionModule is definded in bayesnet.expressionnet

getAnnotations(terms=None)

Get annotations.

Parameters: term (str) – optional list of terms for which annotations are returned. Default None=all terms.
getDefaultParameters()

return a hash with default parameter value for this model

getF()

Get imputed expression values

getName(base_name='slalom')

return a name summarising the main parameters

getNchanged()

Return number of annotations changed by the model (sum of included and exluded genes )

getPi(terms=None)

Get prior on Z (Bernourlli part part of spike-and-slab prior) :param term: optional list of terms for which weights are returned. Default None=all terms. :type term: str

getRelevance()

Get posterior relevance (inverse of ARD score) $$1/Q( extbf{lpha)}$$

getTermIndex(terms)

get term index. Creates an index list based on a list of named terms :param terms: list with terms :type terms: list

getTerms(annotated=True, unannotated=True, unannotated_sparse=True)

Get terms

getW(terms=None)

Get weights (continous part of spike-and-slab prior) $$Q(\widetilde{W})$$ :param term: optional list of terms for which weights are returned. Default None=all terms. :type term: str

getX(terms=None)

Get factors

Parameters: terms (str) – optional list of terms for which weights are returned. Default None=all terms.
getZ(terms=None)

Get posterior of Z (Bernourlli part part of spike-and-slab prior) $$Q(Z)$$ :param term: optional list of terms for which weights are returned. Default None=all terms. :type term: str

getZchanged(terms=None, threshold=0.5)

get matrix indicating whether the posterior distribution has changed for individual terms/genes :param terms: optional list of terms for which weights are returned. Default None=all terms. :type terms: str

Rv:
matrix [0,-1,1]: 0: no change, -1: loss, +1: gain
printDiagnostics()

Print diagnostics of the model. If more than 100% of gene annotations are for at least one factor, the model should be re-fitted with sparse unannotated facotrs.

regressOut(idx=None, terms=None, use_latent=False, use_lm=False, Yraw=None)

Regress out unwanted variation

Parameters: idx (vector_like) – Indices of factors to be regressed out use_latent (bool) – Boolean varoable indicating whether to regress out the unwanted variation on the low-dimensional latent space or the high-dimensional gene expression space. use_lm (bool) – Regress out the factors by fitting a linear model for each gene Yraw (array_like) – Optionally a gene expression array can be passed from which the facotrs are regressed out A matrix containing the corrected expression values.
train(nIterations=None, forceIterations=False, tolerance=1e-08, minIterations=700)

Iterate updates of weights (with spike-and-slab prior), ARD parameters, factors, annd noise parameters.

Parameters: nIternation (int) – Number of iterations. forceIterations (bool) – Force the model to update nIteration times. tolerance (float) – Tolerance to monitor convergence of reconstruction error minIterations (int) – Minimum number of iterations the model should perform.
update()

Perform update of weights (with spike-and-slab prior), ARD parameters, factors, annd noise parameters. Called by iterate method.

## Plotting and saving results¶

slalom.plotFactors(FA=None, terms=None, X=None, lab=[], cols=None, isCont=True, madFilter=0.4)

Scatter plot of 2 selected factors

Parameters: FA (slalom.CSparseFA) – Factor analysis object, usually generated using initFA function terms (list) – List of strings containing names of factors to be plotted; if a factor analysis object is provided the corresponding factors are automatically extracted, otherwise this argument will only be used to label the axes idx2 (int) – Index of second factor to be plotted lab (vector_like) – Vector of labels for each data point isCont (bool) – Boolean variable indicating whether labels should be interpreted as discrete or continuous cols (vector_like) – Vector of colors. Should be the same length as unique labels. Default is None, then the brewer2mpl madFilter (float) – Filter factors by this mean absolute deviation to exclude outliers. For large datsets this can be set to 0.
slalom.plotRelevance(FA, Nactive=20, stacked=True, madFilter=0.4, annotated=True, unannotated=False, unannotated_sparse=False)

Plot results of slalom

Identified factors and corresponding gene set size ordered by relevance (white = low relevance; black = high relevance). Top panel: Gene set augmentation, showing the number of genes added (red) and removed (blue) by the model for each factor.

Parameters: FA (slalom.CSparseFA) – Factor analysis object, usually generated using initFA function Nactive (int) – Numer of terms to be plotted stacked (bool) – Boolean variable indicating whether bars should be stacked db (str) – Name of database used, either ‘MSigDB’ or ‘REACTOME’ madFilter (float) – Filter factors by this mean absolute deviation to exclude outliers. For large datasets this can be set to 0. annotated (bool) – Indicates whether annotated factors should be plotted. Defaults to True. unannotated (bool) – Indicates whether unannotated factors should be plotted. Defaults to False. unannotated – Indicates whether unannotated sparse factors should be plotted. Defaults to False.
slalom.saveFA(FA, out_name=None, saveF=False)

Saves output of slalom.CSparseFA object as hdf5 file

Parameters: FA (slalom.CSparseFA) – Factor analysis object, ususally generated using initFA function out_name (str) – Name of hdf file to save the model to. Default is None’ for which a filename is created automatically. saveF (bool) – Boolean variable indicating whether to save the imputed expression space.
slalom.dumpFA(FA)

Dumps output of slalom.CSparseFA object in a dictionary

Parameters: FA (slalom.CSparseFA`) – Factor analysis object, ususally generated using initFA function dictionary with results

Contents: