An in vivo parallelized reporter assay to uncover tissue-specific splicing regulatory sequences in a multicellular animal

As described in the Methods section of the manuscript, this pipeline begins by assessing the quality of sequences generated from paired-end MiSeq multiplexed sequencing. The R1 and R2 files contain reads from both ends of the cDNA sequences. Prior to analysis, these paired-end reads are merged into a single file to ensure both ends contribute to the calculation of PSI values for the corresponding reporter they map back to.

The reads are then classified into the following categories:

Spliced-in isoforms
Spliced-out isoforms
Unspliced isoforms
Cryptic spliced isoforms
Unknown isoforms

The script Splicing_Classifier_PRA_PSIcal.py performs the following functions:

Classifies each read based on splicing status
Collapses reads using Unique Molecular Identifiers (UMIs)
Parses barcodes introduced downstream of exon 3 in the minigene reporter
Calculates PSI values for each barcode, representing a unique reporter

python Splicing_Classifier_PRA_PSIcal.py CL1_merged.fastq
python Splicing_Classifier_PRA_PSIcal.py CL2_merged.fastq

This step extracts barcodes from reporter transcripts, linking specific intronic PRA elements to the observed splicing patterns, as quantified by the PSI values computed in the same script.

Once PSI values are calculated, Fisher's Exact Test is used to identify intronic PRA elements with statistically significant effects on splicing. P-values are corrected for multiple testing using a False Discovery Rate (FDR) < 0.01 (Benjamini-Hochberg method).

The script FishersExactTest.py is used on the output file generated by Splicing_Classifier_PRA_PSIcal.py, which contains both read counts and PSI values. The FDR correction was applied manually post-analysis.

1.2. Identifying Core k-mers within Splicing Regulatory Intronic PRA Elements

1.2.1. Generating k-mer Count Matrices for Neuronal and Muscle PRA Elements

Each intronic PRA element is scanned for k-mers of length 4 to 7 (k = 4, 5, 6, 7). This produces a k-mer count matrix where:

Rows correspond to PRA elements
Columns correspond to k-mers
Values indicate the number of times each k-mer appears

As described in the study, 3 nucleotides flanking both ends of the PRA elements are included to capture junctional regulatory motifs.

The script kmer_count_matrix.py is used to generate separate count matrices for neuronal and muscle PRA elements.

1.2.2. Identifying Core Splicing Regulatory k-mers via Univariate Linear Regression

PSI values are independently regressed against k-mer counts. In this setup:

Independent variable: k-mer count
Dependent variable: PSI value

Only k-mers that appear in at least 5 intronic PRA elements are included in the analysis.

The script UniLinearRegression_kmercounts_PSI.py outputs:

Regression coefficients (slopes)
P-values (adjusted using FDR < 0.05)
t-statistics to represent regulatory effect sizes

As described in the Methods, t-stat values are preferred as they normalize regression weights across datasets and account for standard error, enabling comparison across k-mers of different lengths.

1.2.3. Clustering k-mers with Tissue-specific Splicing Regulatory Effects

k-mers showing significant regulatory effects are clustered based on sequence similarity using methods described in the PEKA paper (Ule lab). This helps identify motifs with similar regulatory functions or shared binding specificities.

2. PAM score representations of PRA splicing regulatory clusters across switch-like exon flanking intron fragments

2.1. Performing protein alignments to find best orthologs

The protein alignments were conducted using MAFFT (Katoh et al., 2019). MAFFT can be installed using conda -

conda install -c bioconda mafft

MAFFT was used with the following parameters. This can be enclosed within shell scripts to parallely run on multiple protein sequence files.

~/mafft --quiet --auto --thread 3  protein_seq > protein_seq.aln

Next, best_multi_homologfinder.sh is ran on these protein sequence alignments, and this picks the best ortholog per species for C. elegans from the ortholog group by calculating percent identity (PID). It calls the Best_homologfinder.py script and outputs -

Best Ortholog, with highest PID sequence containing files into the user specified directory ("*_best_homologs.fa" extension). For example -

>CELEG.F36H1.2e
---------------------------------MV---------QT--------------
--LRR---------------P--W----------QEAA--SSAFAVASALPV-TMNSTQI
AELFEQVEHGTTE-LRCALTAEISALRNANGESLLTVAVRSGNTAVAKQLAQLDPDA-ID
...
>CAFRA.g8544.t2_PID:0.1400137268359643
---------------------------------MVV--------QT--------------
--LLR---------------P--W----------QEAA--TSAFAVASALPV-TMNSTQI
AELFEQVEQGESEQLRCALTAELISMRNANGESLLVVAARVGNSAVAKQLIHLESSQFLN
...

File containing list of all orthologs from a species and corresponding PID with C.elegans ortholog within the group.

Event	Ref	Sps	PID
./AminoAcid_SwitchEvents_aln/B0348.4.fa.aln	CELEG	CBECE.CSP29.g10568.t2	0.10133333333333333
./AminoAcid_SwitchEvents_aln/B0348.4.fa.aln	CELEG	CBOVI.g1040.t1	0.20364238410596028
./AminoAcid_SwitchEvents_aln/B0348.4.fa.aln	CELEG	CBOVI.g1041.t2	0.19103773584905662
./AminoAcid_SwitchEvents_aln/B0348.4.fa.aln	CELEG	CBREN.CBN18509	0.07034372501998401
./AminoAcid_SwitchEvents_aln/B0348.4.fa.aln	CELEG	CBREN.CBN25584	0.07344632768361582
...

2.2. Extracting best orthologs nucleotide sequences based on previously calculated protein alignment PIDs.

Next, WG_multi_besthomolog.sh is run to subset the whole gene sequences and only include ortholog sequences with highest PID for a given species, per protein alignment. The shell script runs WG_besthomolog.py on each file in the given directory.

nohup bash WG_multi_besthomolog.sh BH_protien_alignments_directory nucleotide_fasta_directory ./output_folder/ > output.txt &

It outputs files with *_PIDfilter.fa!

Additional steps are performed to ensure there's only one occurence of each species name in each file in the directory and thus only the best ortholog is chosen from each species.

2.3. Whole gene sequence alignments within ortholog groups

The obtained gene sequence files with best orthologs, are aligned using MAFFT. Before running MAFFT, reference species, C.elegans is made the first sequence in every ortholog group file as MAFFT forces the orientation of the first sequence. The following awk command was used -

for file in *.fa; do     awk '{if ($0 ~ /^>CELEG/) {header=$0; getline; seq=$0} else {body=body ORS $0}} END {print header ORS seq body}' "$file" > temp.fa && mv temp.fa "$file"; done

The purpose behind aligning the whole gene sequences within ortholog groups was to further extract introns in each species based on exon alignments. The assumption here is that an exon in C.elegans aligns to an exon in the orthologous sequences from other species, whereas introns fall in the same region as they do for C.elegans gene sequence. We later checked exon-intron boundaries to be confident with the intron fragments we extract from these alignments. This time --adjustdirectionaccurately parameter is used so that MAFFT can find the best orientation for a given orthologous sequence and align it to C.elegans ortholog per ortholog group. Following command is used -

~/mafft --adjustdirectionaccurately --quiet --auto --thread 3  transcript_name_PIDfilter_file > transcript_name_PIDfilter_file.aln

Before proceeding the orientation of C.elegans sequence is checked and it is ensured that negative strand genes and positive strand genes are in correct orientation. Since, the exons and introns are located using coordinates, negative strand gene transcripts should be on negative strand whereas positive strand genes shoud be on positive strand (following the reference species, C.elegans) for the next step.

2.4. Extracting per species intron fragments from whole gene sequence alignments.

nohup bash Fragments_AlignedSeqs.sh ./Neuro_Mus_Switch/TranscriptNames_Cutterlab.txt BH_nucleotide_fasta_aln_directory ./output_folder/ > output.txt &

#This in turn calls IntronSegments_alnfastas.py and it is run on different ortholog groups as follows - 
#python IntronSegments_alnfastas.py B0348.4 ./Neuro_Mus_Switch//BH_nucleotide_fasta_aln_directory/B0348.4_PIDfilter.fa.aln #~/Exon_IntronFragment_coordinates.tsv ~/introns_BH_nucleotide_fasta_directory/

It outputs this file while running and the information can be checked to ensure it is running as expected.

Directory for search: BH_nucleotide_fasta_aln_directory
dir ./BH_nucleotide_fasta_aln_directory/
Searching aligned files for : B0348.4d.1
The B0348.4d.1 matched with B0348.4 fasta :o. Now extracting intron fragments! :)
The exon start : 0. The exon end : 83
       start  end
10424     84  154
The length of C.elegans intron fragment is : 70
The relative ungapped coordinates are 84 and 154
The corresponding gapped coordinates are 386 and 1751
*-------------------------------------*-*-------------------------------------*
The exon start : 1488. The exon end : 1849
       start   end
10425   1417  1487
10427   1850  1920
The length of C.elegans intron fragment is : 70
The relative ungapped coordinates are 1417 and 1487
The corresponding gapped coordinates are 6224 and 6464
...

The inference of adjacent introns relies on a given exon. Therefore, if an exon in one species cannot be aligned to exons in others, the adjacent introns for that exon will not be inferred in those species. The inferred introns are checked for fragments of same size across each species in the files chosen arbitrarily. In this case, we are using all introns shorter than 70 bp and the longer introns are broken into 70 bp fragments from either splice site.

Additionally, the introns inferred for negative strand genes were reverse complemented to bring them to the correct orientation before proceeding. The following command was run -

nohup bash revcomp_Introns.sh ./Negative_Strand_gene_names.txt - introns_BH_nucleotide_fasta_directory ./output_folder/ &

For calculating the conservation scores, we ensure that sequences within an ortholog group does not contain ambiguous nucleotides, doesn't contain redundant sequences and contains sequences with a minimum distance of 0.01. Hence, the intronic sequences are aligned just for this quality control step where sequences not passing the abovementioned criteria are filtered out and the sequences that pass are stored in files with extension '.filtered'. The custom scripts from Alam et al. were used for this step. It additionally yeilds a file where the summary of the changes made to each sequence file are logged. The output is -

FILE	REF	Nbefore	Nafter	total_distance	stats
B0348.4d.1_exon_0_83_DownstreamIntronFragment_84_154.fasta	CELEG.B0348.4d	34	50	165.0	too close:16 
B0348.4d.1_exon_10333_10575_DownstreamIntronFragment_10576_10646.fasta	CELEG.B0348.4d	36	50	175.0	too close:14 
B0348.4d.1_exon_14008_14149_DownstreamIntronFragment_14150_14220.fasta	CELEG.B0348.4d	40	50	195.0	Ns:1 too close:9 
B0348.4d.1_exon_14008_14149_UpstreamIntronFragment_13937_14007.fasta	CELEG.B0348.4d	43	46	210.0	too close:3 
B0348.4d.1_exon_1488_1849_DownstreamIntronFragment_1850_1920.fasta	CELEG.B0348.4d	45	50	220.0	too close:5 
B0348.4d.1_exon_1488_1849_UpstreamIntronFragment_1417_1487.fasta	CELEG.B0348.4d	42	46	205.0	short:1 too close:3

2.5. Computing phylogenetically averaged motif (PAM) scores for PRA splicing regulatory cluster PWMs in intronic fragments

The script compute_PAMS.py, was adapted from Alam et al. for intronic fragments. Certain parameters were modified for this RNA context -

MOL_TYPE = "RNA" 
minN = 5 # Minimum number of orthologs required for PAM score calculation
SEQ_L = 70 # Length of intronic fragments
FILE_EXT = ".filtered" # Ensuring that PAM scores are only calculated using clean files with filtered sequences from the quality control step.
outfilename = "./PAMS_celegansintrons_PRAkmerclusters_switchsplc_minN5.csv" # Name of the output file can be specified within the script
dropped_seqs = "./Cels_dropped_seqsminN5.txt" # The ortholog groups containing less than 5 orthologs are dropped and their name is stored in this file

The script requires tensorflow to be installed. The version of tensorflow depends on the CUDA version of the server the script will be run on. Hence, tensorflow version 2.7 was installed. This was done using the following command, after which a virtual environment was created for the script to run.

conda install tensorflow==2.7
conda create -n tensorflow
conda activate tensorflow

Lastly, the compute_pams.py script was run using PRA splicing regulatory cluster PWMs and intronic fragments using the following command.

nohup python compute_pams_SB_edited.py ./finalintrons_BH/ PRA-SplcRegCluster.PWM.txt  > PAMS.nohup.out &

This yeilds PAMS_2025celegansintrons_PRAkmerclusters_switchsplc_minN5.txt file with PAM scores for all PRA splciing regulatory PWMs. It looks like follows -

FILE    NeuSEcluster1_UGGAGRS   NeuSEcluster2_DUUGUG    NeuSEcluster3_-AAGUU    NeuSEcluster4_GCAU      NeuSEcluster5_WGGUWMA   MusSECluster1_-RGRAS    MusSECluster2_-YGUGUG   MusSECluster3_-KACAG    MusSECluster4_GAUGG     MusSECluster5-UAAYW     MusSECluster6_GAGU      MusSECluster7_-GKKUG    MusSECluster8_GGACUA    MusSECluster9_-GGUAAY   MusSECluster10-AUGUG    MusSSCluster1_UGGCA     MusSSCluster2_-CYAR     MusSSCluster3_UCYS      MusSSCluster4_AAAU      NeuSSCluster1-YGGS
B0348.4d.1_exon_3781_4252_UpstreamIntronFragment_3710_3780.fasta.aln.RevComp.fa.filtered        -0.5715919      -2.0252972      0.27339175      0.7248105       1.0379028       0.97454596      -1.3302739      -1.7360657      -1.0541451      0.6657335       -1.3233831      0.008267455     0.24173972      0.68004733      -0.3339234      -1.2455492      0.3099639       -0.044258855    -0.16147317     -2.5894935
B0348.4d.1_exon_20722_20837_DownstreamIntronFragment_20838_20908.fasta.aln.RevComp.fa.filtered  1.8435016       1.6923336       0.5837084       -0.39636886     -1.3698089      1.3717935       -1.6330974      -0.05333686     -0.90812415     0.000974301     0.44604513      0.71645886      -1.1334926      -0.70597774     0.4796527       0.014692736     -0.889162       -0.068109155    1.3347595       0.44796857
...

2.6. Clustering the PAM scores output file and visualizing the resulting clusters using JavaTreeview

cluster3 was used to cluster the PAM scores output file. cluster3 can be found here - http://bonsai.hgc.jp/~mdehoon/software/cluster/. The parameters chosen for clustering are same as Alam et al. paper are -

Adjusting data by centering arrays on median
Performing heirarchical clustering on genes and on arrays using average linkage Manual referred - http://bonsai.hgc.jp/~mdehoon/software/cluster/cluster.pdf

Following command was run -

nohup cluster -f PAMS_2025celegansintrons_PRAkmerclusters_switchsplc_minN5.csv -ca m -g 1 -e 1 -m a &

The resulting clustered PAM scores output file with extension '.cdt' is further processed and metadata inferred intially using gtf_exon-intron_annotations pipeline is used - metadata_Introns_annotated.tsv

nohup python Adding_meta_toPAMS.py metadata_Introns_annotated.tsv PAMS_celegansintrons_PRAkmerclusters_switchsplc_minN5.cdt PAMS_PRAclusters_meta.csv 70 &

The clustered and annotated file can be then imported into JavaTreeview. This software can be downloaded from - https://jtreeview.sourceforge.net The pixel settings were adjusted accordingly.

Data Release

Note : Description of all releases is embedded in text https://github.com/sanjanabhatnagar/PRA-and-C.elegans-PAM-paper/releases

Data is available in releases under these tags -

Neuronal PRA Plasmid Sequence data - Release v1.1.1
Neuronal PRA cDNA Sequence data - Release v1.1.2
Muscle PRA Plasmid Sequence data - Release v1.2.1
Muscle PRA cDNA Sequence data - Release v1.2.2 (R1 reads) and Release v1.2.3 (R2 reads)
Smaller-Scale PRA data - Release v1.3.1
k-mer count and Univariate Linear Regression Data - Release v1.4.0
PAM data - Release v2.0.0

References

Fusca, D. D., Kasimatis, K. R., Zhu, H. V., & Cutter, A. D. (2025). Dynamic Birth and Death of Argonaute Gene Family Functional Repertoire Across Caenorhabditis Nematodes. Genome Biol. Evol, 17(6). https://doi.org/10.1093/gbe/evaf106

Kuret, K., Amalietti, A. G., Jones, D. M., Capitanchik, C., & Ule, J. (2022). Positional motif analysis reveals the extent of specificity of protein-RNA interactions observed by CLIP. Genome Biology, 23(1). https://doi.org/10.1186/s13059-022-02755-2

Katoh, K., Rozewicki, J., & Yamada, K. D. (2019). MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization. 20(4), 1160–1166. https://doi.org/10.1093/bib/bbx108

Alam, A., Duncan, A. G., Mitchell, J. A., & Moses, A. M. (biorxiv). Functional similarity of non-coding regions is revealed in phylogenetic average motif score representations. https://doi.org/10.1101/2023.04.09.536185

M. J. L. de Hoon, S. Imoto, J. Nolan, and S. Miyano: Open Source Clustering Software. Bioinformatics, 20 (9): 1453--1454 (2004).

Alok J. Saldanha, Java Treeview—extensible visualization of microarray data, Bioinformatics, Volume 20

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
1_PRA_code		1_PRA_code
2_PAM_code		2_PAM_code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

An in vivo parallelized reporter assay to uncover tissue-specific splicing regulatory sequences in a multicellular animal

Table of contents

1. Processing PRA MiSeq data and identifying ISS and ISE from PRA elements

1.1. Processing cDNA MiSeq Reads and Calculating PSI Values

1.2. Identifying Core k-mers within Splicing Regulatory Intronic PRA Elements

1.2.1. Generating k-mer Count Matrices for Neuronal and Muscle PRA Elements

1.2.2. Identifying Core Splicing Regulatory k-mers via Univariate Linear Regression

1.2.3. Clustering k-mers with Tissue-specific Splicing Regulatory Effects

2. PAM score representations of PRA splicing regulatory clusters across switch-like exon flanking intron fragments

2.1. Performing protein alignments to find best orthologs

2.2. Extracting best orthologs nucleotide sequences based on previously calculated protein alignment PIDs.

2.3. Whole gene sequence alignments within ortholog groups

2.4. Extracting per species intron fragments from whole gene sequence alignments.

2.5. Computing phylogenetically averaged motif (PAM) scores for PRA splicing regulatory cluster PWMs in intronic fragments

2.6. Clustering the PAM scores output file and visualizing the resulting clusters using JavaTreeview

Data Release

References

About

Uh oh!

Releases 10

Packages

Languages

sanjanabhatnagar/in-vivoPRA-and-C.elegans-PAM-paper

Folders and files

Latest commit

History

Repository files navigation

An in vivo parallelized reporter assay to uncover tissue-specific splicing regulatory sequences in a multicellular animal

Table of contents

1. Processing PRA MiSeq data and identifying ISS and ISE from PRA elements

1.1. Processing cDNA MiSeq Reads and Calculating PSI Values

1.2. Identifying Core k-mers within Splicing Regulatory Intronic PRA Elements

1.2.1. Generating k-mer Count Matrices for Neuronal and Muscle PRA Elements

1.2.2. Identifying Core Splicing Regulatory k-mers via Univariate Linear Regression

1.2.3. Clustering k-mers with Tissue-specific Splicing Regulatory Effects

2. PAM score representations of PRA splicing regulatory clusters across switch-like exon flanking intron fragments

2.1. Performing protein alignments to find best orthologs

2.2. Extracting best orthologs nucleotide sequences based on previously calculated protein alignment PIDs.

2.3. Whole gene sequence alignments within ortholog groups

2.4. Extracting per species intron fragments from whole gene sequence alignments.

2.5. Computing phylogenetically averaged motif (PAM) scores for PRA splicing regulatory cluster PWMs in intronic fragments

2.6. Clustering the PAM scores output file and visualizing the resulting clusters using JavaTreeview

Data Release

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 10

Packages 0

Languages

Packages