Decoding Example: Reed-Solomon Codes
note: this example was generated using data and a jupyter notebook that are freely available at the SeqFISHSyndromeDecoding github repository and on Google Colab.
using Pkg
Pkg.activate(".")
Pkg.instantiate()
using DataFrames
using CSV
using SeqFISHSyndromeDecoding
using GLPK
using DelimitedFiles
This notebook shows demonstrates how to use SeqFISHSyndromeDecoding.jl. The smallest cell with the fewest dots in our Reed-Solomon encoded experiment, chosen for computational convienience. We also reduce computation time by using the highest lateral position variance reported in our manuscript with half the search radius used in the manuscript computations. The larger positional variance penalty would prohibit most additional candidate barcodes found with larger search radius.
First, load the codebook that we will use to decode our sample data.
cb = DataFrame(CSV.File("../example_data/full_RS_q11_k7_half_pool_cb.csv"))
println(first(cb, 5))
DataFrame
Row │ gene block1 block2 block3 block4 block5 block6 block7 block8 block9 block10
│ String31 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────
1 │ Aars 2 8 5 0 0 0 0 0 2 0
2 │ Aco2 10 7 0 0 0 0 8 0 0 6
3 │ Actn4 9 7 0 0 9 2 0 0 0 0
4 │ Aebp1 2 0 1 3 0 0 0 2 0 0
5 │ Aqp1 10 0 3 0 0 0 9 1 0 0
Define the parity check matrix for the codebook
H = readdlm("../example_data/RS_q11_k7_H.csv", ',', UInt8)
3×10 Matrix{UInt8}:
0x02 0x04 0x08 0x05 0x0a 0x09 0x07 0x03 0x06 0x01
0x04 0x05 0x09 0x03 0x01 0x04 0x05 0x09 0x03 0x01
0x08 0x09 0x06 0x04 0x0a 0x03 0x02 0x05 0x07 0x01
We can verify that H is actually the parity check matrix of the codebook.
all(H * Matrix(cb[:,2:end])' .% 11 .== 0)
true
Next we can load the aligned points from each hybridization for our example cell.
pnts = DataFrame(CSV.File("../example_data/example_RS_cell_points.csv"))
filter!(pnt -> ~ismissing(pnt.pseudocolor), pnts)
pnts.block = UInt8.(pnts.block)
select!(pnts, Not([:ch,:hyb]))
SeqFISHSyndromeDecoding.sort_readouts!(pnts)
println(first(pnts, 5))
DataFrame
Row │ Column1 x y s w z pos pseudocolor block cellid
│ Int64 Float64 Float64 Float64 Float64 Float64 Int64 Float64? UInt8 Int64
─────┼─────────────────────────────────────────────────────────────────────────────────────────
1 │ 0 94.938 1884.65 1.21555 267.268 0.0 6 1.0 1 6
2 │ 3815 105.651 1873.63 1.32669 253.041 0.0 6 1.0 1 6
3 │ 8 109.834 1836.6 2.0 304.417 0.0 6 1.0 1 6
4 │ 26 110.489 1873.2 1.47603 884.79 0.0 6 1.0 1 6
5 │ 33 110.8 1885.17 1.50077 571.21 0.0 6 1.0 1 6
Next we initialize a DecodeParams
object, and set the parameters
params = DecodeParams()
set_zeros_probed(params, false)
set_lat_var_cost_coeff(params, 7.0)
set_z_var_cost_coeff(params, 0.0)
set_lw_var_cost_coeff(params, 0.0)
set_s_var_cost_coeff(params, 0.0)
set_free_dot_cost(params, 1.0)
set_n_allowed_drops(params, 0)
set_xy_search_radius(params, 2)
set_z_search_radius(params, 0.0);
We can then decode
barcodes = decode_syndromes!(pnts, cb, H, params);
println(first(barcodes, 5))
DataFrame
Row │ gene gene_number cpath cost x y z cc cc_size
│ String31? Any Any Float64 Any Any Any Int64 Int64
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ St13 220 [7127, 9160, 11942, 13432] 0.209086 114.12 1890.67 0.0 1 11
2 │ Ybx1 263 [13625, 16733, 21893, 24052] 0.621507 113.441 1891.12 0.0 1 11
3 │ negative_control 299 [6, 7923, 10905, 16128] 0.790891 110.843 1889.99 0.0 1 11
4 │ Msn 112 [5598, 7267, 19024, 23060] 3.25967 111.708 1890.39 0.0 1 11
5 │ Rarg 179 [11, 4496, 16538, 22858] 1.42675 114.981 1918.55 0.0 2 1
Alternatively, if we aren't sure what parameters we want to use, we can save time by splitting decodesyndromes! into its two steps. First we can identify barcode candidates with the ```getcodepaths``` (named for the paths that candidate barcodes take the the decoding graph in figure 1a) function using the least strict parameter set that we are interested in.
candidates = get_codepaths(pnts, cb, H, params);
println(first(candidates, 5))
1m5×7 DataFrame
Row │ gene gene_number cpath cost x y z
│ String31? Any Any Any Any Any Any
─────┼─────────────────────────────────────────────────────────────────────────────────────────────
1 │ Ercc1 40 [4965, 12136, 16528, 19370] 0.286652 106.317 1843.72 0.0
2 │ Vim 261 [9474, 14973, 16125, 21132] 0.651009 108.019 1846.97 0.0
3 │ Ercc1 40 [4971, 12138, 16529, 19371] 1.27286 108.758 1849.4 0.0
4 │ negative_control 971 [763, 4117, 6469, 19371] 3.28321 108.719 1849.74 0.0
5 │ Slc38a2 209 [1571, 7619, 12630, 14130] 1.16154 108.261 1854.62 0.0
We can then use the choose_optimal_codepaths
function to find the same barcodew that we found earlier
barcodes_again = choose_optimal_codepaths(pnts, cb, H, params, candidates, GLPK.Optimizer)
barcodes == barcodes_again
true
We can now also try choosing candidates using stricter parameters. This saves computation time by reducing the number of times that we have to run get_codepaths
.
strict_params = DecodeParams()
set_zeros_probed(strict_params, false)
set_lat_var_cost_coeff(strict_params, 10.0)
set_z_var_cost_coeff(strict_params, 0.0)
set_lw_var_cost_coeff(strict_params, 0.0)
set_s_var_cost_coeff(strict_params, 0.0)
set_free_dot_cost(strict_params, 1.0)
set_n_allowed_drops(strict_params, 0)
set_xy_search_radius(strict_params, 2)
set_z_search_radius(strict_params, 0.0);
stricter_barcodes = choose_optimal_codepaths(pnts, cb, H, strict_params, candidates, GLPK.Optimizer)
println(first(stricter_barcodes, 5))
DataFrame
Row │ gene gene_number cpath cost x y z cc cc_size
│ String31? Any Any Float64 Any Any Any Int64 Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ negative_control 299 [6, 7923, 10905, 16128] 1.12984 110.843 1889.99 0.0 1 2
2 │ Rarg 179 [11, 4496, 16538, 22858] 2.03821 114.981 1918.55 0.0 2 1
3 │ Khdrbs1 73 [13, 15343, 19604, 23071] 0.231704 117.603 1892.83 0.0 3 1
4 │ Spp1 214 [18, 10922, 11844, 14141] 2.79514 119.057 1867.67 0.0 4 1
5 │ Spp1 214 [19, 10921, 11846, 14142] 2.42644 119.15 1862.08 0.0 5 2
We can compare the decoding results using the two different sets of parameters. For brevity, we use gene encoding barcodes found in decoding runs that include searches for negative control barcodes, which differs from the procedure described in our manuscript in which datasets are also decoded with the negative control codewords ommitted from the codebook.
println("Number of gene encoding barcodes: ", sum(barcodes.gene .!= "negative_control"))
estimated_false_discovery_rate = sum(barcodes.gene .== "negative_control")*sum(cb.gene .!= "negative_control")/sum(cb.gene .== "negative_control")/sum(barcodes.gene .!= "negative_control")
println("Estimated False Discovery rate: ", estimated_false_discovery_rate)
Number of gene encoding barcodes: 1737
Estimated False Discovery rate: 0.01654845665984571
println("Number of gene encoding barcodes: ", sum(stricter_barcodes.gene .!= "negative_control"))
estimated_false_discovery_rate = sum(stricter_barcodes.gene .== "negative_control")*sum(cb.gene .!= "negative_control")/sum(cb.gene .== "negative_control")/sum(stricter_barcodes.gene .!= "negative_control")
println("Estimated False Discovery rate: ", estimated_false_discovery_rate)
Number of gene encoding barcodes: 1303
Estimated False Discovery rate: 0.011712467380864363
The less strict parameter set decodes about 40% more gene encoding barcodes at a cost of having twice the estimated false discovery rate. Since the estimated false positive rate is still small, it is probably an acceptable trade off.
To save your results, use the CSV.write
command.
CSV.write("example_RS_results.csv", barcodes)
"example_RS_results.csv"