Stages of  Data Analysis
========================

This pipeline outlines the key stages for analyzing all-to-all interactome sequencing data, focusing on the steps that lead to the final results of RNA-DNA contact pairs, RNA annotation, and significant peaks of chromatin-interacting RNAs.

1. Input and Preprocessing
--------------------------
* Read input data (FASTQ files) and validate sample information
* Perform quality control with FASTQC

2. Deduplication (optional)
---------------------------
* Remove duplicate reads using tools like:

  - fastq-dupaway
  - fastuniq
  - clumpify

PCR duplicates are identified and removed. This step also implements our own software: `fastq-dupaway <https://github.com/AndrewSigorskih/fastq-dupaway>`_   
for removing duplicate reads, which in contrast to programs such as FastUniq [] works optimally with memory and has other problems fixed.

3. Trimming
-----------
* Trim low-quality bases and adapters using tools like:

  - fastp
  - Trimmomatic
  - BBduk
  - cutadapt

4. Bridge Processing (for specific experiment types)
----------------------------------------------------

While creating description sequence for experiments that have non-separated raw reads, we
include bridge processing step. 

Based on the bridge sequence, included in the configuration file single-end or paired-end reads are
split into files RNA.fastq and DNA.fastq. Bridge/Linker Sequence in the configuration file
always has the following orientation 5'-{RNA}-{Forward Bridge Sequence}-{DNA}-3'

Bridge processing tools include our own tool based on bitap-search and debridge.jl program based on fuzzy
search from charseq https://github.com/straightlab/chartools/tree/main/Jchartools

* For experiments like GRID-seq, RADICL-seq, iMARGI, etc., process the bridge sequences that connect RNA and DNA parts
* Use tools like BBMerge or PEAR to merge paired-end reads
* Separate RNA and DNA parts based on the bridge sequence

5. Restriction sites filtering 
----------------------------------------------------

- DNA Sequence Start (\*): Begin your DNA sequence with the ``*`` symbol to indicate the start of the DNA part.

- Add Sequence (`+[CATG]`): Use the + operator followed by the sequence you want to add in square brackets. For example, +[CATG]* means you are adding the sequence "CATG" 
to the 5' of the DNA part.


- RNA Sequence Start (`.`): If you need to specify an RNA sequence, use the . symbol to indicate the start of the RNA part. In this case, it seems to be used as an endpoint.

6. Alignment
------------
* Align RNA and DNA reads to the reference genome using tools like:

  - HISAT2
  - STAR
  - BWA-MEM
  - Bowtie2

6. Post-alignment Processing
----------------------------
* Filter aligned reads for uniqueness and mismatches
* Convert BAM files to BED format

7. Contact Generation
---------------------
* Join RNA and DNA parts to create raw contacts
* Perform strand detection and correction

8. CIGAR Filtering (optional)
-----------------------------
* Filter contacts based on CIGAR strings to improve quality

9. Merging Replicates
---------------------
* Combine data from replicate experiments

10. Chromosome Splitting (optional)
-----------------------------------
* Split data by chromosomes for parallel processing

11. Annotation and Voting
-------------------------
* Annotate RNA parts of contacts using reference annotation
* Perform voting to resolve conflicting annotations

12. Background Model Generation
-------------------------------
* Create a background model for normalization

13. Normalization
-----------------
* Normalize raw contacts using the background model
* Perform additional normalization steps (N2, scaling)

14. Peak Calling (for One-to-All experiments)
---------------------------------------------
* Use MACS2 to call significant peaks of chromatin-interacting RNAs

15. Statistics and Visualization
--------------------------------
* Generate statistics at various stages of the pipeline
* Create plots and visualizations of the results

16. MultiQC Report
------------------
* Compile a comprehensive quality control report using MultiQC

Main Results
------------
The main results of this pipeline are:

1. Pairs of RNA and DNA contacts, stored in tab-separated files
2. Annotation of the RNA parts of the contacts
3. Significant peaks of chromatin-interacting RNAs (for One-to-All experiments)
4. Various statistics and quality control metrics throughout the process

Note: This pipeline is flexible and can handle different types of all-to-all interactome sequencing data, with options to customize the workflow based on the specific experiment type and analysis requirements.