Lim et al. (2021) processing code

Summary

This directory contains data-munging code to combine the DNA, total RNA, and polysome counts with corresponding 5' UTR sequence data from Lim et al. (2021) ("Multiplexed functional genomic analysis of 5’ untranslated region mutations across the spectrum of prostate cancer").

Input

The code takes three inputs:

lim_2021_fig6a_utr5_seqs.csv: lightly munged CSV export of Supplementary Data 6 from the paper. After the CSV export, I removed a few extraneous rows toward the end and filtered out invalid rows with this command:

grep -v -i -e "Incomplete set" -e "Not in pool" -e "Not in pacbio" | grep -v '^,,,$'
lim_2021_totalrna_dna.csv: total RNA and DNA counts provided via e-mail by the authors on 2022.04.11, then exported to CSV.
lim_2021_totalrna_polysome.csv: total RNA and polysome counts. For a given barcode, the total RNA counts in this file should exactly match those in lim_2021_totalrna_dna.csv (though this assumption was not validated beyond spot-checking a few cases by eye).

Note that lim_2021_totalrna_dna.csv and lim_2021_totalrna_polysome.csv were extracted from a single Excel file, where they were present in the same sheet as distinct column sets.

Outputs

The code produces two output files.

mutants.csv: mutated 5' UTR sequences where, for a given barcode, all of (polysome, total RNA, DNA) read-outs were available. Only SNVs should be present.
wildtype.csv: 5' UTR wildtype sequences where, for a given barcode, all of (polysome, total RNA, DNA) read-outs were available

For both files, the full 5' UTR sequences for the genes (with or without SNV) should be provided. Records were retained only when the 5' UTR sequence could be resolved unambiguously from the gene (and, for mutant records, the mutation) listed.

Read-outs were combined (by taking the mean) across multiple barcodes (i.e., collapsing the dataframes row-wise), and across multiple replicates (i.e., collapsing the dataframes column-wise).

Exactly the same genes should be represented in mutants.csv, with a many-to-one relationship between mutant and wildtype records (i.e., every wildtype sequence should have one or more mutated sequences in correspondence).

Values for DNA, RNA, and polysome are in units of counts-per-million (CPM).

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
LICENSE		LICENSE
README.md		README.md
process.py		process.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lim et al. (2021) processing code

Summary

Input

Outputs

About

Uh oh!

Releases

Packages

Languages

License

deepgenomics/lim_2021_data

Folders and files

Latest commit

History

Repository files navigation

Lim et al. (2021) processing code

Summary

Input

Outputs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages