Data preparation


From our perspective, Gemini provide a flexible but standard database to store SNPs, INDELs and CNVs. It integrates directly the pedigree information, that we fully use, and can include additional columns to indicate more than one phenotype. (We have some work to do in order to allow for an easy customization of the App to adapt to variable format of phenotype columns)

It consists in a single sqlite file, that you will feed into the Cytoscape App.

More informations on how to install Gemini, and how to load your data in it can be found on their website:


In case of WGS, we recommend the users to load in Gemini only the exonic variants and not to constitute a database with all the variants. This would reduce greatly the storage space needed for the database, and also greatly accelerate the data loading.


If you need to install additional tools to prepare your data and load them into Gemini, we recommend to use bioconda.

For this the best way is to install the conda package that you can download there:

Once installed, you need to add the channels for bioconda to work:

  • conda config –add channels conda-forge
  • conda config –add channels defaults
  • conda config –add channels r
  • conda config –add channels bioconda

Then install the tools needed (we will specify the commands for each tools).

Genome version

As it is, Gemini is designed for hg19/Grch37, so it is better to stick with it. However it should evolve rapidly in their next releases. They offer a way to put other version of the genome in the database using an independent script, but we didn’t test it, nor Gravity with it.

VCF calling

All the samples must be called and put together to form a cohort VCF file.

The recommended way to do this is either to use the GATK HaplotypeCaller or to use the Freebayes joint calling option. You can find detailled informations reading: GATK Best Practices for Germline SNP & Indel Discovery in Whole Genome and Exome Sequence.


Install it using: conda install gatk

In case were the user have individual VCFs for all of his patients, and can’t (or don’t want) to redo the calling across all of them at the same time, it is possible to use a tool such as bcftools merge to create some sort of cohort VCF. However this approach is not recommended as the file will contains a lot of missing genotypes.


Install it using: conda install bcftools

Variants annotations

It is recommended to use VEP to annotate the VCF file. It is not necessary to add additional annotations such as Clinvar, CADD, Exac, 1000genome as Gemini is automatically adding them.


Install it using: conda install variant-effect-predictor

Pedigree file

The user should make a pedigree file for his samples. It consists in a tabulation separated text file with at least 6 columns:

  • Family_ID
  • ID
  • Paternal_ID
  • Maternal_ID
  • Sex
  • Phenotype

It is possible to add other columns with other informations. We usually have a column Family to indicate us if the the family is ASD or Control, but this is very likely to evolve soon to something easily parameterized.

Loading in Gemini

Then the annotated VCF and the Pedigree are used to create a Gemini database such as described in their documentation.

gemini load -v my_cohort_vcf_file -t VEP -p my_cohort_pedigree_file my_gemini_database


Install it using: conda install gemini