================
Data preparation
================

------
Gemini
------

From our perspective, Gemini provide a flexible but standard database to store SNPs, INDELs and CNVs. It integrates directly the pedigree information, that we fully use, and can include additional columns to indicate more than one phenotype. (We have some work to do in order to allow for an easy customization of the App to adapt to variable format of phenotype columns)

It consists in a single sqlite file, that you will feed into the Cytoscape App.

More informations on how to install Gemini, and how to load your data in it can be found on their website: https://gemini.readthedocs.io/en/latest/

.. note::
  In case of WGS, we recommend the users to load in Gemini only the exonic variants and not to constitute a database with all the variants. This would reduce greatly the storage space needed for the database, and also greatly accelerate the data loading.

.. tip::
	If you need to install additional tools to prepare your data and load them into Gemini, we recommend to use **bioconda**.

	For this the best way is to install the conda package that you can download there: `conda.io/miniconda.html <https://conda.io/miniconda.html>`_

	Once installed, you need to add the channels for bioconda to work:

	* *conda config --add channels conda-forge*
	* *conda config --add channels defaults*
	* *conda config --add channels r*
	* *conda config --add channels bioconda*

	Then install the tools needed (we will specify the commands for each tools).

--------------
Genome version
--------------

As it is, Gemini is designed for hg19/Grch37, so it is better to stick with it. However it should evolve rapidly in their next releases. They offer a way to put other version of the genome in the database using an independent script, but we didn't test it, nor Gravity with it.

-----------
VCF calling
-----------

All the samples must be called and put together to form a cohort VCF file.

The **recommended** way to do this is either to use the GATK HaplotypeCaller or to use the Freebayes joint calling option. You can find detailled informations reading: `GATK Best Practices for Germline SNP & Indel Discovery in Whole Genome and Exome Sequence <https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS>`_.

.. tip::
	Install it using: *conda install gatk*

In case were the user have individual VCFs for all of his patients, and can't (or don't want) to redo the calling across all of them at the same time, it is possible to use a tool such as *bcftools merge* to create some sort of cohort VCF. However this approach is **not recommended** as the file will contains a lot of missing genotypes.

.. tip::
	Install it using: *conda install bcftools*

--------------------
Variants annotations
--------------------

It is recommended to use `VEP <http://www.ensembl.org/info/docs/tools/vep/index.html>`_ to annotate the VCF file. It is not necessary to add additional annotations such as Clinvar, CADD, Exac, 1000genome as Gemini is automatically adding them.

.. tip::
	Install it using: *conda install variant-effect-predictor*

-------------
Pedigree file
-------------

The user should make a pedigree file for his samples. It consists in a tabulation separated text file with at least 6 columns:

* Family_ID
* ID
* Paternal_ID
* Maternal_ID
* Sex
* Phenotype

It is possible to add other columns with other informations. We usually have a column *Family* to indicate us if the the family is *ASD* or *Control*, but this is very likely to evolve soon to something easily parameterized.

-----------------
Loading in Gemini
-----------------

Then the annotated VCF and the Pedigree are used to create a `Gemini <https://github.com/arq5x/gemini>`_ database such as described in their `documentation <https://gemini.readthedocs.io/en/latest/>`_.

.. code-block:: bash

	gemini load -v my_cohort_vcf_file -t VEP -p my_cohort_pedigree_file my_gemini_database

.. tip::
	Install it using: *conda install gemini*