Input file preparation

Samplesheet

Valid file extensions
  • “.fq.gz”,

  • “.fastq.gz”,

  • “.fastq”,

  • “.fq”

Genome

REMOVE UNCANONICAL CHROMOSOMES

seqkit grep -vrp "^chrUn" file.fa > cleaned.fa

Chromosome mappings NCBI -> UCSC

Gene Annotation

Please follow through our guidelines for preparing gene annotation in a format compatible with various bioinformatics tools. We require the gene annotation in GTF/GFF format suitable for the aligner and a "bed-like format with the following columns:

chr     start     end     name     strand     biotype     source

Requirements for Bed-like Gene Annotation File

  • 1-based Indexing: Use 1-based indexing for specifying coordinates.

  • Unique “name” Field: Ensure that values in the “name” field are unique.

  • No Duplicate Coordinates: Ensure that there are no genes with identical coordinates (absence of duplicates).

  • Chromosome Naming Consistency: Chromosome names in the annotation file must match those in the genome assembly.

    Note

    For instance, when canonical chromosomes in the genome align with those in the GTF, but non-canonical ones do not, careful attention is required.

  • Single GTF File Usage: If utilizing a single GTF file, you can convert it to a bed-like file using our script.

If employing multiple annotations for a shared gene annotation, manual processing is required (refer to provided examples).

Shared Gene Annotation Issues

When using multiple annotations, be aware of potential issues related to shared gene annotation: