==========
Input data
==========

We offer 2 options for the format of the input files:

#. The standard Variant Calling Format (VCF) [DANECEK2011]_
#. A GEMINI database [PAILA2013]_

This section will describe specific options accepted for those files, and complementary formatting details.

----------------------
Variant Calling Format
----------------------

This is by far the most common file format used for genetic data. It will be produced by most caller such as GATK.

**VEP**

We require the user to annotate this VCF file using VEP [MCLAREN2016]_. The following fields need to be highlighted:

* SYMBOL (required)
* Consequence (required)
* IMPACT (required)
* CANONICAL (optional, but highly recommended)
* HGVS (optional)

**Additional annotations**

Optionally, we recommend the addition of supplementary annotations such as *gnomAD* frequencies [LEK2016]_, or the CADD [RENTZSCH2018]_ prediction for the deleteriousness of the variants.

**bgzip & tabix**

Once the vcf file is produced, it needs to be compress with *bgzip* and then indexed with *tabix*. Both tools are available at http://www.htslib.org/download/.

**Pedigree**

Finally, the user can optionally create a pedigree file along with the 2 previous files that will provide GRAVITY with additional informations such as the relationships between individuals which will simplify the work for the user, particularly on bigger cohorts. This will also allow Gravity to know about the phenotype of individuals, and thus open up a few more functionalities. This pedigree file needs to:

* be named exactly the same as the compressed vcf file with the addition of the extension *.ped* as the end (*e.g.* if the vcf is called *CEU-Trio.vcf.gz*, then the pedigree file must be named *CEU-Trio.vcf.gz.ped*).
* the pedigree file must be formatted similarly to plink *fam* files:

  * tabulated
  * with no header
  * 6 columns (respectively FID, IID, paternal ID, maternal ID, sex and phenotype).
  * -9 or 0 for missing informations
  * 1 for males, 2 for females
  * 1 for unaffected, 2 for affected

.. note::
  Our demo dataset from the *Getting started* section was created by following the Broad Institute best practices [DEPRISTO2011]_ [VANDERAUWERA2013]_.

  We applied VEP version 91 for the functional annotations, then used *vcfanno* [PEDERSEN2016]_ to add the *gnomAD* [LEK2016]_ and *CADD* [RENTZSCH2018]_ annotations.

------
GEMINI
------

Gemini [PAILA2013]_ is a powerful framework for exploring genetic data. It has the particularity to import all the genetic data inside a database and to add numerous annotations (*i.e.* gnomAD frequencies, CADD, ClinVar annotations). The database presents the advantage of being standardised in terms of attributes names and content formatting. GRAVITY accepts this database (a **.db** file) as an input.

Their `documentation website <https://gemini.readthedocs.io/en/latest/>`_ explains how to easily create such a database.

References
----------

.. [DANECEK2011] Danecek, Petr, et al. `"The variant call format and VCFtools." <https://dx.doi.org/10.1093%2Fbioinformatics%2Fbtr330>`_ Bioinformatics 27.15 (2011): 2156-2158.
.. [PAILA2013] Paila, Umadevi, et al. `"GEMINI: integrative exploration of genetic variation and genome annotations." <https://doi.org/10.1371/journal.pcbi.1003153>`_ PLoS computational biology 9.7 (2013): e1003153.
.. [MCLAREN2016] McLaren, William, et al. `"The ensembl variant effect predictor." <https://doi.org/10.1186/s13059-016-0974-4>`_ Genome biology 17.1 (2016): 122.
.. [PEDERSEN2016] Pedersen, Brent S., Ryan M. Layer, and Aaron R. Quinlan. `"Vcfanno: fast, flexible annotation of genetic variants." <https://doi.org/10.1186/s13059-016-0973-5>`_ Genome biology 17.1 (2016): 118.
.. [RENTZSCH2018] Rentzsch, Philipp, et al. `"CADD: predicting the deleteriousness of variants throughout the human genome." <https://doi.org/10.1093/nar/gky1016>`_ Nucleic acids research (2018).
.. [LEK2016] Lek, Monkol, et al. `"Analysis of protein-coding genetic variation in 60,706 humans." <http://dx.doi.org/10.1038/nature19057>`_ Nature 536.7616 (2016): 285.
.. [DEPRISTO2011] DePristo, Mark A., et al. `"A framework for variation discovery and genotyping using next-generation DNA sequencing data." <https://doi.org/10.1038/ng.806>`_ Nature genetics 43.5 (2011): 491.
.. [VANDERAUWERA2013] Van der Auwera, Geraldine A., et al. `"From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline." <https://doi.org/10.1002/0471250953.bi1110s43>`_ Current protocols in bioinformatics 43.1 (2013): 11-10.