Stages of Data Analysis

This pipeline outlines the key stages for analyzing all-to-all interactome sequencing data, focusing on the steps that lead to the final results of RNA-DNA contact pairs, RNA annotation, and significant peaks of chromatin-interacting RNAs.

1. Input and Preprocessing

Read input data (FASTQ files) and validate sample information
Perform quality control with FASTQC

2. Deduplication (optional)

Remove duplicate reads using tools like:
- fastq-dupaway
- fastuniq
- clumpify

PCR duplicates are identified and removed. This step also implements our own software: fastq-dupaway for removing duplicate reads, which in contrast to programs such as FastUniq [] works optimally with memory and has other problems fixed.

3. Trimming

Trim low-quality bases and adapters using tools like:
- fastp
- Trimmomatic
- BBduk
- cutadapt

4. Bridge Processing (for specific experiment types)

While creating description sequence for experiments that have non-separated raw reads, we include bridge processing step.

Based on the bridge sequence, included in the configuration file single-end or paired-end reads are split into files RNA.fastq and DNA.fastq. Bridge/Linker Sequence in the configuration file always has the following orientation 5’-{RNA}-{Forward Bridge Sequence}-{DNA}-3’

Bridge processing tools include our own tool based on bitap-search and debridge.jl program based on fuzzy search from charseq https://github.com/straightlab/chartools/tree/main/Jchartools

For experiments like GRID-seq, RADICL-seq, iMARGI, etc., process the bridge sequences that connect RNA and DNA parts
Use tools like BBMerge or PEAR to merge paired-end reads
Separate RNA and DNA parts based on the bridge sequence

5. Restriction sites filtering

DNA Sequence Start (*): Begin your DNA sequence with the * symbol to indicate the start of the DNA part.
Add Sequence (+[CATG]): Use the + operator followed by the sequence you want to add in square brackets. For example, +[CATG]* means you are adding the sequence “CATG”

to the 5’ of the DNA part.

RNA Sequence Start (.): If you need to specify an RNA sequence, use the . symbol to indicate the start of the RNA part. In this case, it seems to be used as an endpoint.

6. Alignment

Align RNA and DNA reads to the reference genome using tools like:
- HISAT2
- STAR
- BWA-MEM
- Bowtie2

6. Post-alignment Processing

Filter aligned reads for uniqueness and mismatches
Convert BAM files to BED format

7. Contact Generation

Join RNA and DNA parts to create raw contacts
Perform strand detection and correction

8. CIGAR Filtering (optional)

Filter contacts based on CIGAR strings to improve quality

9. Merging Replicates

Combine data from replicate experiments

10. Chromosome Splitting (optional)

Split data by chromosomes for parallel processing

11. Annotation and Voting

Annotate RNA parts of contacts using reference annotation
Perform voting to resolve conflicting annotations

12. Background Model Generation

Create a background model for normalization

13. Normalization

Normalize raw contacts using the background model
Perform additional normalization steps (N2, scaling)

14. Peak Calling (for One-to-All experiments)

Use MACS2 to call significant peaks of chromatin-interacting RNAs

15. Statistics and Visualization

Generate statistics at various stages of the pipeline
Create plots and visualizations of the results

16. MultiQC Report

Compile a comprehensive quality control report using MultiQC

Main Results

The main results of this pipeline are:

Pairs of RNA and DNA contacts, stored in tab-separated files
Annotation of the RNA parts of the contacts
Significant peaks of chromatin-interacting RNAs (for One-to-All experiments)
Various statistics and quality control metrics throughout the process

Note: This pipeline is flexible and can handle different types of all-to-all interactome sequencing data, with options to customize the workflow based on the specific experiment type and analysis requirements.