Different scRNA-seq datasets are built using different genome references, namely different species and different genome versions. infercna comes with a number of inbuilt genomes that can be configured to work with the package. See below for more information.
Note: Additional genomes / genome versions will be added in future releases of infercna.
currentGenome()
Note that the default genome that infercna is built to use is genome hg19
. This is the penultimate release version of the H. sapiens genome. You can always check which genome is configured with infercna::currentGenome()
.
useGenome()
You can configure infercna to work with one of the inbuilt genomes. Currently built-in genomes are the two latest versions of H. sapiens, hg38
and hg19
and the latest version of the M. musculus, mm10
.
retrieveGenome()
You can also get the genome data (in dataframe form). By default retrieveGenome
will return the current genome in use, but you can instead specify a different genome in the function call.
retrieveGenome()
#> Retrieving: mm10
#> # A tibble: 54,513 x 8
#> symbol start_position end_position chromosome_name arm band strand
#> <chr> <int> <int> <fct> <fct> <chr> <int>
#> 1 49334… 3073253 3074322 1 1 A1 1
#> 2 Gm262… 3102016 3102125 1 1 A1 1
#> 3 Xkr4 3205901 3671498 1 1 A1 -1
#> 4 Gm189… 3252757 3253236 1 1 A1 1
#> # … with 54,509 more rows, and 1 more variable: ensembl_gene_id <chr>
retrieveGenome(name = 'hg19')
#> Retrieving: hg19
#> # A tibble: 33,575 x 8
#> symbol start_position end_position chromosome_name arm band strand
#> <chr> <dbl> <dbl> <fct> <fct> <chr> <int>
#> 1 DDX11… 11869 14412 1 1p p36.… 1
#> 2 WASH7P 14363 29806 1 1p p36.… -1
#> 3 MIR13… 29554 31109 1 1p p36.… 1
#> 4 FAM13… 34554 36081 1 1p p36.… -1
#> # … with 33,571 more rows, and 1 more variable: ensembl_gene_id <chr>
retrieveGenome(name = 'hg38')
#> Retrieving: hg38
#> # A tibble: 37,501 x 8
#> symbol start_position end_position chromosome_name arm band strand
#> <chr> <int> <int> <fct> <fct> <chr> <int>
#> 1 DDX11… 11869 14409 1 1p p36.… 1
#> 2 WASH7P 14404 29570 1 1p p36.… -1
#> 3 MIR68… 17369 17436 1 1p p36.… -1
#> 4 MIR13… 29554 31109 1 1p p36.… 1
#> # … with 37,497 more rows, and 1 more variable: ensembl_gene_id <chr>
The default genomes available in infercna have the following columns.
symbol
: gene namechromosome_name
: chromosome namearm
: chromosome arm
chromosome_name
in genome mm10
start_position
: of each geneend_position
: of each geneband
: chromosome bandstrand
: chromosome strandensembl_gene_id
Note: Only the first four columns amongst these are required if you intend to add your genome (discussed below in more detail).
addGenome()
In some cases you may need to configure infercna with your own genome, either because it is not built-in with infercna, or because it is a genome that you built yourself, tailored to your purposes.
Using infercna::addGenome
, it is possible to configure infercna to use the genome of your choosing, so long as it meets the following requirements.
Note: addGenome
will return an error if any of these requirements are not met.
Be a dataframe
Contain the following columns
symbol
: gene name
SOX11, BRAF, TP53
chromosome_name
: chromosome name
levels(data$chromosome_name) == c(1:22, "X", "Y")
arm
: chromosome arm name
levels(data$arm) == c('1p', '1q', ..., 'Yp', 'Yq')
arm
as a duplicate of chromosome_name
.start_position
: of each geneColumns chromosome_name
and arm
are factor columns (see above)
Note: You can add a genome with as many additional columns as you like. For example, you may want to have end_position
and band
columns in addition to those listed above.
Data = data.frame(symbol = letters,
chromosome_name = factor(LETTERS),
arm = factor(LETTERS),
start_position = seq(1, length(letters)*2, by = 2),
end_position = seq(2, length(letters)*2, by = 2))
tibble::as_tibble(Data)
#> # A tibble: 26 x 5
#> symbol chromosome_name arm start_position end_position
#> <fct> <fct> <fct> <dbl> <dbl>
#> 1 a A A 1 2
#> 2 b B B 3 4
#> 3 c C C 5 6
#> 4 d D D 7 8
#> # … with 22 more rows
The default columns are those that are required for one or more of the following:
Order the genes by their genomic position: chromosome_name
, start_position
+ Note that the input ordering of symbol
column does not matter.
Split the genes by their chromosomal position: chromosome_name
, arm
+ Note that you can additionally split genes by additional columns that you might add, such as band
.