The expression of genes in bulk samples reflects the combined effect from multiple expressing cell types and cell states, and therefore many genes that are good markers for a particular cellular state in single cell data may not be good markers in bulk data. To address this problem, this function identifies genes that are associated with high frequency of a particular cellular state. This analysis can highlight, amongst other things, genes whose sample-wide expression promotes particular cellular states.

state_high_genes(
  scores,
  groups,
  scores2 = scores,
  gene.occurence = 0.5,
  min = 0.5,
  diff = NULL
)

Arguments

scores

a dataframe of cell signature scores. Cells are rows and signatures are columns. Samples or <groups> are split into state-high and state-low groups according to the states in <scores> columns.

groups

a list of cell IDS by sample.

scores2

a dataframe of cell signature scores. Cells are rows and signatures are columns. Cells within state-high and state-low groups are further subdivided into cellular states according to the columns in <scores2>, to allow for state-frequency controlled comparisons between state-high and state-low groups in bulk. In most cases, this can be left as the default : where the cellular state definitions are the same as provided by <scores>. Default: scores

gene.occurence

the fraction of times a gene must be differentially expressed in order to be retained as a final high-freq-associated gene. Default: 0.5

min

minimum value required for a cell (row in <scores>) to be assigned to a state (col in <scores>). Default: 0.5

diff

minimum difference in value to the 'next-best' row that is required for a row to be assigned to a column. Default: NULL

Value

a list of genes per state (as defined by the columns in <scores>)

Details

the function proceeds through the following steps: 1. we define state-high and state-low groups. This involves finding which samples have a significantly high, or significantly low, fraction of the cellular state in question. See `?scalop::assign_bulk_states` for details. 2. for each state, we perform differential expression analysis between the state-high samples and the state-low samples. To control for differing state frequencies between corresponding state-high and state-low groups, the differential expression analysis is performed for each cellular state in turn. Note that it is here that <scores2> is relevant. See `?scalop::dea` for details. 3. Per state-high - state-low comparison, we integrate the sets of DE genes (one set per cellular state) by keeping only those genes that appear across cellular states. You can modify across what fraction of the total number of cellular states a DE gene needs to be observed in order to be kept. See <gene.occurence> for details.