Stages of Data Analysis ======================== This pipeline outlines the key stages for analyzing all-to-all interactome sequencing data, focusing on the steps that lead to the final results of RNA-DNA contact pairs, RNA annotation, and significant peaks of chromatin-interacting RNAs. 1. Input and Preprocessing -------------------------- * Read input data (FASTQ files) and validate sample information * Perform quality control with FASTQC 2. Deduplication (optional) --------------------------- * Remove duplicate reads using tools like: - fastq-dupaway - fastuniq - clumpify PCR duplicates are identified and removed. This step also implements our own software: `fastq-dupaway `_ for removing duplicate reads, which in contrast to programs such as FastUniq [] works optimally with memory and has other problems fixed. 3. Trimming ----------- * Trim low-quality bases and adapters using tools like: - fastp - Trimmomatic - BBduk - cutadapt 4. Bridge Processing (for specific experiment types) ---------------------------------------------------- While creating description sequence for experiments that have non-separated raw reads, we include bridge processing step. Based on the bridge sequence, included in the configuration file single-end or paired-end reads are split into files RNA.fastq and DNA.fastq. Bridge/Linker Sequence in the configuration file always has the following orientation 5'-{RNA}-{Forward Bridge Sequence}-{DNA}-3' Bridge processing tools include our own tool based on bitap-search and debridge.jl program based on fuzzy search from charseq https://github.com/straightlab/chartools/tree/main/Jchartools * For experiments like GRID-seq, RADICL-seq, iMARGI, etc., process the bridge sequences that connect RNA and DNA parts * Use tools like BBMerge or PEAR to merge paired-end reads * Separate RNA and DNA parts based on the bridge sequence 5. Restriction sites filtering ---------------------------------------------------- - DNA Sequence Start (\*): Begin your DNA sequence with the ``*`` symbol to indicate the start of the DNA part. - Add Sequence (`+[CATG]`): Use the + operator followed by the sequence you want to add in square brackets. For example, +[CATG]* means you are adding the sequence "CATG" to the 5' of the DNA part. - RNA Sequence Start (`.`): If you need to specify an RNA sequence, use the . symbol to indicate the start of the RNA part. In this case, it seems to be used as an endpoint. 6. Alignment ------------ * Align RNA and DNA reads to the reference genome using tools like: - HISAT2 - STAR - BWA-MEM - Bowtie2 6. Post-alignment Processing ---------------------------- * Filter aligned reads for uniqueness and mismatches * Convert BAM files to BED format 7. Contact Generation --------------------- * Join RNA and DNA parts to create raw contacts * Perform strand detection and correction 8. CIGAR Filtering (optional) ----------------------------- * Filter contacts based on CIGAR strings to improve quality 9. Merging Replicates --------------------- * Combine data from replicate experiments 10. Chromosome Splitting (optional) ----------------------------------- * Split data by chromosomes for parallel processing 11. Annotation and Voting ------------------------- * Annotate RNA parts of contacts using reference annotation * Perform voting to resolve conflicting annotations 12. Background Model Generation ------------------------------- * Create a background model for normalization 13. Normalization ----------------- * Normalize raw contacts using the background model * Perform additional normalization steps (N2, scaling) 14. Peak Calling (for One-to-All experiments) --------------------------------------------- * Use MACS2 to call significant peaks of chromatin-interacting RNAs 15. Statistics and Visualization -------------------------------- * Generate statistics at various stages of the pipeline * Create plots and visualizations of the results 16. MultiQC Report ------------------ * Compile a comprehensive quality control report using MultiQC Main Results ------------ The main results of this pipeline are: 1. Pairs of RNA and DNA contacts, stored in tab-separated files 2. Annotation of the RNA parts of the contacts 3. Significant peaks of chromatin-interacting RNAs (for One-to-All experiments) 4. Various statistics and quality control metrics throughout the process Note: This pipeline is flexible and can handle different types of all-to-all interactome sequencing data, with options to customize the workflow based on the specific experiment type and analysis requirements.