slalom documentation¶
slalom is a scalable modelling framework for singlecell RNAseq data that can be used to dissect and model singlecell transcriptome heterogeneity, thereby allowing to identify biological drivers of celltocell variability and model confounding factors.
Software by Florian Buettner and Oliver Stegle. slalom is explained in detail in the accompanying publication [1].
[1] Buettner, F.,Pratanwanich, N., Marioni, J., Stegle, O. Scalable latentfactor models applied to singlecell RNAseq data separate biological drivers from confounding effects. Submitted
Installation¶
slalom requires Python 2.7 or newer with
 scipy, h5py, numpy, matplotlib, scikitlearn, re
slalom can be installed via pip with pip install slalom. For best results, we recommend the Anaconda python distribution (https://anaconda.org).
Quickstart¶
Input¶
slalom requires two input files, a gene expression file and an annotation file. The gene expression file is a text file containing the normalised, logtransformed gene expression matrix, with every row corresponding to a cell. Column names should be gene identifiers matching those in the annotation file (i.e. if gene symbols are used in the annotation file, column names should also be gene symbols). Row names are optional and can eg be a known covariate which is used for plotting.
The annotation file is a text file with every row containing the name of a gene set, followed by the gene identifiers annotated to that gene set. We recommend using annotations such as those published in the REACTOME database or the Molecular signature database (MSigDB) and provide an annotation file containing annotations from the RECTOME database. The license of MSigDB does not permit redistribution of the raw annotation files. To use MSigDB annotations, please register at http://software.broadinstitute.org/gsea/msigdb, download the hallmark gene sets (gene symbols) and place the file in data folder.
Both text files can then be loaded using the load_text
function.
NB: slalom works best on a subset of highly variable genes; these can be identified using a variance filter based on a meanvariance trend fitted using spikein transcripts or endogenous genes. A stepbystep workflow on lowlevel processing of scRNAseq data (including gene filtering) can be found here: https://f1000research.com/articles/52122/v2
Model initialisation and fitting¶
An slalom model can be initialised using the initFA
convenience function. Arguments can be used to specify options, incuding number of unannotated factors (nHidden), minimum number of genes in a pathway (minGenes), whether to use the fast option by pruning genes (pruneGenes), noise model (noise) and the data directory (data_dir). Once a model is initialised, it can be fit using the train
method.
Diagnostics, plotting and saving.¶
The printDiagnostics
function can be used to print diagnositcs based on the number of genes included/excluded by the model. If more than 100% of gene annotations are changed for at least one factor, the model should be refitted with sparse unannotated facotrs.
Tutorial¶
All steps required to run slalom are illustrated in a jupyter notebook that can be viewed interactively.
The factorial singlecell latent variable model (slalom)¶
A detailed statistical description of the slalom model can be found in teh accompanyin publicaiton [1]. Here, a brief summary is given. slalom is based on a variant of matrix factorization, decomposing the observed gene matrix into a sum of sum of contributions from A annotated factors, whose inference is guided by pathway gene sets, and H additional unannotated factors:
Here, the vectors \(\mathbf{p}_a\) and \(\mathbf{s}_h\) are factor states for annotated and unannotated factors and \(\mathbf{R}_a\) and \(\mathbf{Q}_h\) are the corresponding regulatory weights of a given factor on all genes. The matrix \(\mathbf{psi}\) denotes residual noise. For the statistical derivation in the accompanying publication and the implementation, we express this mode using matrix notation, collapsing the factors into a factor activation matrix \(\mathbf{X} = [\mathbf{r}_1,\dots,\mathbf{r}_A,\mathbf{s}_1,\dots,\mathbf{s}_H]\) (with the comma denoting concatenation of columns), where each factor is enumerated using an indicator \(k = 1 \dots K\), and K denotes the total number of fitted factors \(K = A + H\). The analogous matrix representation is used for weights \(\mathbf{W}\), resulting in
We employ two levels of regularization on the parts of the weight matrix \(\mathbf{W}\) corresponding to annotated factors. First, gene sets are used to guide a spikeandslab prior on the rows of \(\mathbf{W}\) thereby confining the inferred weights to the set of genes annotated in the pathway database. To this end \(\mathbf{W}\) is modelled as elementwise product of a Bernoulli random variable \(\mathbf{Z}\), indicating whether a gene is active for a given factor and a Gaussian random variable \(\widetilde{\mathbf{W}}\), quantifying the corresponding effect size (for details see [1]). A second level of regularization is then used to achieve sparseness on the level of factors, allowing the model to deactivate factors that are not needed to explain variation in the data; this is achieved using an automatic relevance determination (ARD) prior (i.e. factorspecific Gamma prior on the precision of the weights). The inverse of this ARD prior (\(1/\alpha_k\)) can be interpreted as a measure of the regulatory impact of factor \(k\) and corresponds to the expected variance explained by this factor, for the subset of genes with a regulatory effect. It is therefore also referred to as relevance parameter. The slalom software implements an efficent deterministic approximate Bayesian inference scheme based on variational methods, allowing for the inference of \(\mathbf{X}\), \(\mathbf{Z}\), \(\widetilde{\mathbf{W}}\), \(\mathbf{\alpha}\), \(\mathbf{\psi}\) and other parameters.
Loading data and model initialisation¶

slalom.
load_txt
(dataFile, annoFiles, niceTerms=True, annoDBs='MSigDB', dataFile_delimiter=', ', verbose=True)¶ Load input file for slalom from txt files.
Loads an txt files and extracts all the inputs required by slalom
Parameters:  dataFile (str) – Strong containing the file name of the text file with the expression levels
 dataFile_delimiter (str) – delimiter for reading the data_file. Defaults to ‘,’.
 annoFiles (str, list) – Either string containing the file name of the txt file with the gene set annotations or a list containing several anotation files. Each line in in an annotattion file corresponds one gene set; a line starts with the name of the gene set and is followed by the annotated genes.
 annoDBs (str, list) – database file (MsigDB/REACTOME). If several annotation files are provided this hast to be a list of the same length.
 niceTerms (bool) – Indicates whether to nice terms (omit prefix, capitalize, shorten). Defaults to true.
 dataFile_delimiter – Delimiter used in dataFile; defaults to ‘,’.
 verbose (bool) – Show progress on loading terms (defaults to True).
Returns: An dictionary containing all the inputs required by slalom.

slalom.
load_hdf5
(dFile, anno='MSigDB')¶ Load input file for slalom
Loads an hdf file and extracts all the inputs required by slalom
Parameters: dFile (str) – String contaning the file name of the hdf file with the input data. Returns: An dictionary containing all the inputs required by slalom.

slalom.
initFA
(Y, terms, I, gene_ids=None, nHidden=3, nHiddenSparse=0, pruneGenes=True, FPR=0.99, FNR=0.001, noise='gauss', minGenes=20, do_preTrain=True, nFix=None, priors=None, covariates=None, dropFactors=True, learnPi=False)¶ Initialise the slalom factor analysis model.
Required 3 inputs are first, a gene expression matrix Y containing normalised count values of N cells and G variable genes in logspace, second a vector terms contaning the names of all annotated gene set (correspondig to annotated factors) and third, a binary indicator matrix I linking G genes to K terms by indicating which genes are annotated to each factor. A variety of options can be specified as described below.
Parameters:  Y (array_like) – Matrix of normalised count values of N cells and G variable genes in logspace. Dimension (\(N\times G\)).
 terms (vector_like) – Names of K annotated gene sets. Dimension (\(K\times 0\)).
 I (array_like) – Indicator matrix specifying whether a gene is annotated to a specific factor. Dimension (\(G\times K\)).
 gene_ids (array_like) – Gene identifiers (opitonal, defaults to None)
 FNR (float) – False negative rate of annotations. Defaults to 0.001
 FPR (float) – False positive rate of annotations. Defaults to 0.99
 nHidden (int) – Number of unannotated dense factors. Defaults to 3.
 nHiddenSparse (int) – Number of unannotated sparse factors. Defaults to 0. This value should be changed to e.g. 5 if the diagnositcs fail.
 pruneGenes (bool) – prune genes that are not annotated to a least one factor. This option allows fast inference and should be set to True either if the key objective is to rank factors or if the annotations cover all genes of interest. Defaults to True.
 dropFactors (bool) – drop factors from update schedule once they are shut off. In practice, factors that are switched off at some point during inference are usuallly not switched off. Allows faster inference. Defaults to True. Currently only supported for the Gaussian noise model.
 noise (str) – Specifies the observation noise model. Should be either ‘gauss’,`’hurdle’` or ‘poisson’. Defaults to gauss.
 minGenes (int) – minimum number of genes required per term to retain it Defaults to 20.
 do_preTrain (bool) – Boolean switch indicating whether pretraining should be used to establish the initial update order. Can be set to False for very large datasets. Defaults to True
 priors (dict) – Dictionary containing the hyperparameters of the priors for Alpha, Eps and Pi (PiDense and PiSparse). Defaults to None; in this case default values are used.
 covariates (array_like) – Matrix with known covariates that are controlled for when fitting the model. Defaults to None.
 learnPi (bool) – Learn sparsity of spearse hidden factors (default False)
Returns: A
slalom.CSparseFA
instance.

slalom.
preTrain
(Y, terms, P_I, noise='gauss', nFix=None, priors=None, covariates=None)¶ Pretrain the slalom factor analysis model.
Helper function to pretrain the slalom factor analysis model to achieve faster convergence and obtain an initial update order. Called by initFA.
Parameters:  Y (array_like) – Matrix of normalised count values of N cells and G variable genes in logspace. Dimension (\(N\times G\)).
 terms (vector_like) – Names of K annotated gene sets. Dimension (\(K\times 0\)).
 P_I (array_like) – Matrix specifying the likelihood of whether a gene is annotated to a specific factor. Dimension (\(G\times K\)).
 noise (str) – Specifies the observation noise model. Should be either ‘gauss’,`’hurdle’` or ‘poisson’. Defaults to gauss.
 nFix (int) – Number of terms which should be fixed and updated first. Defaults to None, resulting in the number of unannotated factors being updated first.
Returns: A vector containing the initial update order of the terms
Model fitting¶

class
slalom.core.
CSparseFA
(init_data=None, **parameters)¶ Variational Bayesian Factor analysis module. AExpressionModule is definded in bayesnet.expressionnet

getAnnotations
(terms=None)¶ Get annotations.
Parameters: term (str) – optional list of terms for which annotations are returned. Default None=all terms.

getDefaultParameters
()¶ return a hash with default parameter value for this model

getF
()¶ Get imputed expression values

getName
(base_name='slalom')¶ return a name summarising the main parameters

getNchanged
()¶ Return number of annotations changed by the model (sum of included and exluded genes )

getPi
(terms=None)¶ Get prior on Z (Bernourlli part part of spikeandslab prior) :param term: optional list of terms for which weights are returned. Default None=all terms. :type term: str

getRelevance
()¶ Get posterior relevance (inverse of ARD score) \(1/Q( extbf{lpha)}\)

getTermIndex
(terms)¶ get term index. Creates an index list based on a list of named terms :param terms: list with terms :type terms: list

getTerms
(annotated=True, unannotated=True, unannotated_sparse=True)¶ Get terms

getW
(terms=None)¶ Get weights (continous part of spikeandslab prior) \(Q(\widetilde{W})\) :param term: optional list of terms for which weights are returned. Default None=all terms. :type term: str

getX
(terms=None)¶ Get factors
Parameters: terms (str) – optional list of terms for which weights are returned. Default None=all terms.

getZ
(terms=None)¶ Get posterior of Z (Bernourlli part part of spikeandslab prior) \(Q(Z)\) :param term: optional list of terms for which weights are returned. Default None=all terms. :type term: str

getZchanged
(terms=None, threshold=0.5)¶ get matrix indicating whether the posterior distribution has changed for individual terms/genes :param terms: optional list of terms for which weights are returned. Default None=all terms. :type terms: str
 Rv:
 matrix [0,1,1]: 0: no change, 1: loss, +1: gain

printDiagnostics
()¶ Print diagnostics of the model. If more than 100% of gene annotations are for at least one factor, the model should be refitted with sparse unannotated facotrs.

regressOut
(idx=None, terms=None, use_latent=False, use_lm=False, Yraw=None)¶ Regress out unwanted variation
Parameters:  idx (vector_like) – Indices of factors to be regressed out
 use_latent (bool) – Boolean varoable indicating whether to regress out the unwanted variation on the lowdimensional latent space or the highdimensional gene expression space.
 use_lm (bool) – Regress out the factors by fitting a linear model for each gene
 Yraw (array_like) – Optionally a gene expression array can be passed from which the facotrs are regressed out
Returns: A matrix containing the corrected expression values.

train
(nIterations=None, forceIterations=False, tolerance=1e08, minIterations=700)¶ Iterate updates of weights (with spikeandslab prior), ARD parameters, factors, annd noise parameters.
Parameters:

update
()¶ Perform update of weights (with spikeandslab prior), ARD parameters, factors, annd noise parameters. Called by iterate method.

Plotting and saving results¶

slalom.
plotFactors
(FA=None, terms=None, X=None, lab=[], cols=None, isCont=True, madFilter=0.4)¶ Scatter plot of 2 selected factors
Parameters:  FA (
slalom.CSparseFA
) – Factor analysis object, usually generated using initFA function  terms (list) – List of strings containing names of factors to be plotted; if a factor analysis object is provided the corresponding factors are automatically extracted, otherwise this argument will only be used to label the axes
 idx2 (int) – Index of second factor to be plotted
 lab (vector_like) – Vector of labels for each data point
 isCont (bool) – Boolean variable indicating whether labels should be interpreted as discrete or continuous
 cols (vector_like) – Vector of colors. Should be the same length as unique labels. Default is None, then the brewer2mpl
 madFilter (float) – Filter factors by this mean absolute deviation to exclude outliers. For large datsets this can be set to 0.
 FA (

slalom.
plotRelevance
(FA, Nactive=20, stacked=True, madFilter=0.4, annotated=True, unannotated=False, unannotated_sparse=False)¶ Plot results of slalom
Identified factors and corresponding gene set size ordered by relevance (white = low relevance; black = high relevance). Top panel: Gene set augmentation, showing the number of genes added (red) and removed (blue) by the model for each factor.
Parameters:  FA (
slalom.CSparseFA
) – Factor analysis object, usually generated using initFA function  Nactive (int) – Numer of terms to be plotted
 stacked (bool) – Boolean variable indicating whether bars should be stacked
 db (str) – Name of database used, either ‘MSigDB’ or ‘REACTOME’
 madFilter (float) – Filter factors by this mean absolute deviation to exclude outliers. For large datasets this can be set to 0.
 annotated (bool) – Indicates whether annotated factors should be plotted. Defaults to True.
 unannotated (bool) – Indicates whether unannotated factors should be plotted. Defaults to False.
 unannotated – Indicates whether unannotated sparse factors should be plotted. Defaults to False.
 FA (

slalom.
saveFA
(FA, out_name=None, saveF=False)¶ Saves output of slalom.CSparseFA object as hdf5 file
Parameters:

slalom.
dumpFA
(FA)¶ Dumps output of slalom.CSparseFA object in a dictionary
Parameters: FA ( slalom.CSparseFA
) – Factor analysis object, ususally generated using initFA functionReturns: dictionary with results
Contents: