Input file preparation
======================


Samplesheet
-----------

**Valid file extensions**
  - ".fq.gz",
  - ".fastq.gz",
  - ".fastq",
  - ".fq"


Genome
------

REMOVE UNCANONICAL CHROMOSOMES

    .. code-block:: console

        seqkit grep -vrp "^chrUn" file.fa > cleaned.fa


Chromosome mappings NCBI -> UCSC  


Gene Annotation
---------------
Please follow through our guidelines for preparing gene annotation in a format compatible with various bioinformatics tools.
We require the gene annotation in ``GTF/GFF`` format suitable for the aligner and a ``"bed-like`` format with the following columns:

:: 

   chr     start     end     name     strand     biotype     source

Requirements for Bed-like Gene Annotation File
----------------------------------------------

- **1-based Indexing**: Use 1-based indexing for specifying coordinates.
- **Unique "name" Field**: Ensure that values in the "name" field are unique.
- **No Duplicate Coordinates**: Ensure that there are no genes with identical coordinates (absence of duplicates).
- **Chromosome Naming Consistency**: Chromosome names in the annotation file must match those in the genome assembly.

  .. note::
     For instance, when canonical chromosomes in the genome align with those in the GTF, but non-canonical ones do not, careful attention is required.

- **Single GTF File Usage**: If utilizing a single GTF file, you can convert it to a bed-like file using our script. 
If employing multiple annotations for a shared gene annotation, manual processing is required (refer to provided examples).

Shared Gene Annotation Issues
-----------------------------

When using multiple annotations, be aware of potential issues related to shared gene annotation: