Decoding Example: SeqFISH+

note: this example was generated using data and a jupyter notebook that are freely available at the SeqFISHSyndromeDecoding github repository and can be run on Google Colab.

using DataFrames
using CSV
using SeqFISHSyndromeDecoding
using GLPK

This notebook shows demonstrates how to use SeqFISHSyndromeDecoding. The example data was taken from the 561 channel of cell number 8 in position 4 of replicate 2 of the 2019 SeqFISH+ NIH3T3 cell experiment. This particular subset of the data was chosen for its small size.

First load the codebook that we will use to decode our sample data.

cb = DataFrame(CSV.File("../example_data/codebook_ch_561.csv"))
println(first(cb, 5))
5×5 DataFrame
 Row │ gene  block1  block2  block3  block4 
     │ String31   Int64   Int64   Int64   Int64  
─────┼───────────────────────────────────────────
   1 │ Atp6v0e2       17      16      13       0
   2 │ Pclo           14      14       4       4
   3 │ Higd1a          0       1       1       0
   4 │ Srrm1           6      14       4      16
   5 │ Mapk8ip3       14      19      13       0

Define the parity check matrix for the codebook

H = [1 1 -1 -1;]
1×4 Matrix{Int64}:
 1  1  -1  -1

We can verify that H is actually the parity check matrix of the codebook.

all(H * Matrix(cb[:,2:end])' .% 20 .== 0)
true

Next we can load the aligned points from each hybridization for our example cell.

pnts = DataFrame(CSV.File("../example_data/example_cell_points.csv"))
println(first(pnts, 5))
5×5 DataFrame
 Row │ hyb    x        y        s        w        
     │ Int64  Float64  Float64  Float64  Float64  
─────┼────────────────────────────────────────────
   1 │     1  767.664  1463.64     1.22   633.14
   2 │     1  759.413  1534.17     1.22  1118.99
   3 │     1  757.458  1501.22     1.22   866.506
   4 │     1  808.817  1400.84     1.22  1292.37
   5 │     1  804.688  1448.16     1.22  1734.99

The SeqFISHSyndromeDecoding package requires that they hybridization column be UInt8s (to increase efficiency), and that there be a z column (for generality to 3d data)

pnts.z = zeros(Float64, nrow(pnts))
pnts.hyb = UInt8.(pnts.hyb);

Next we initialize a DecodeParams object, and set the parameters

params = DecodeParams()

set_lat_var_cost_coeff(params, 8.0)
set_z_var_cost_coeff(params, 0.0)
set_lw_var_cost_coeff(params, 3.2)
set_s_var_cost_coeff(params, 0.0)
set_free_dot_cost(params, 1.0)

set_xy_search_radius(params, sqrt(size(H)[2]/6.0)*3)
set_z_search_radius(params, 0.0);

We can then decode

barcodes = decode_syndromes!(pnts, cb, H, params)
first(barcodes, 5)
5×9 DataFrame
 Row │ gene  gene_number  cpath                      cost      x        y        z    cc     cc_size 
     │ String31?  Int64        Array…                     Float64   Any      Any      Any  Int64  Int64   
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Higd1a               3  [4000, 4203, 8632, 16480]  1.06754   990.275  1517.47  0.0   1406        1
   2 │ Srrm1                4  [1133, 7101, 9168, 15521]  2.14305   764.521  1474.04  0.0     95        2
   3 │ Srrm1                4  [1020, 7057, 9256, 15609]  1.17522   873.084  1535.57  0.0    390        1
   4 │ Srrm1                4  [1177, 7120, 9365, 15689]  2.8198    882.568  1479.06  0.0    431        1
   5 │ Srrm1                4  [1180, 7119, 9368, 15691]  0.241011  869.379  1540.38  0.0    432        1

Alternatively, if we aren't sure what parameters we want to use, we can save time by splitting decode_syndromes! into its two steps. First we can identify barcode candidates with the get_codepaths (named for the paths that candidate barcodes take the the decoding graph in figure 1a) function using the least strict parameter set that we are interested in.

candidates = get_codepaths(pnts, cb, H, params)
println(first(candidates, 5))
5×6 DataFrame
 Row │ cpath                       cost       gene_number  x        y        z   
     │ Array…                      Any        Int64        Any      Any      Any 
─────┼───────────────────────────────────────────────────────────────────────────
   1 │ [4034, 7497, 9354, 14834]   0.0322611          738  792.754  1543.43  0.0
   2 │ [2479, 6290, 9655, 15972]   0.0770821         1300  876.973  1460.59  0.0
   3 │ [1484, 8356, 10071, 16418]  0.0919919          271  884.483  1527.45  0.0
   4 │ [1148, 7491, 11855, 13651]  0.139732          2287  833.953  1495.81  0.0
   5 │ [2959, 6294, 8792, 13517]   0.168859          1755  845.245  1458.23  0.0

We can then use the choose_optimal_codepaths function to find the same barcodew that we found earlier

barcodes_again = choose_optimal_codepaths(pnts, cb, H, params, candidates, GLPK.Optimizer)
barcodes == barcodes_again
true

We can now also try choosing candidates using stricter parameters. This saves computation time by reducing the number of times that we have to run get_codepaths.

strict_params = DecodeParams()
set_lat_var_cost_coeff(strict_params, 12.0)
set_z_var_cost_coeff(strict_params, 0.0)
set_lw_var_cost_coeff(strict_params, 4.8)
set_s_var_cost_coeff(strict_params, 0.0)
set_free_dot_cost(strict_params, 1.0)

set_xy_search_radius(strict_params, sqrt(size(H)[2]/6.0)*3)
set_z_search_radius(strict_params, 0.0);


stricter_barcodes = choose_optimal_codepaths(pnts, cb, H, strict_params, candidates, GLPK.Optimizer)
println(first(stricter_barcodes, 5))
5×9 DataFrame
 Row │ gene  gene_number  cpath                      cost      x        y        z    cc     cc_size 
     │ String31?  Int64        Array…                     Float64   Any      Any      Any  Int64  Int64   
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Higd1a               3  [4000, 4203, 8632, 16480]  1.60131   990.275  1517.47  0.0   1091        1
   2 │ Srrm1                4  [1020, 7057, 9256, 15609]  1.76283   873.084  1535.57  0.0    291        1
   3 │ Srrm1                4  [1133, 7101, 9168, 15521]  3.21457   764.521  1474.04  0.0    312        1
   4 │ Srrm1                4  [1180, 7119, 9368, 15691]  0.361517  869.379  1540.38  0.0    326        1
   5 │ Wbp11                7  [3862, 7010, 9228, 14463]  1.26753   777.584  1569.5   0.0   1049        1

We can compare the decoding results using the two different sets of parameters.

println("Number of gene encoding barcodes: ", sum(barcodes.gene .!= "negative_control"))
estimated_false_discovery_rate = sum(barcodes.gene .== "negative_control")*sum(cb.gene .!= "negative_control")/sum(cb.gene .== "negative_control")/sum(barcodes.gene .!= "negative_control")
println("Estimated False Discovery rate: ", estimated_false_discovery_rate)
Number of gene encoding barcodes: 1474
Estimated False Discovery rate: 0.024709855479406056
println("Number of gene encoding barcodes: ", sum(stricter_barcodes.gene .!= "negative_control"))
estimated_false_discovery_rate = sum(stricter_barcodes.gene .== "negative_control")*sum(cb.gene .!= "negative_control")/sum(cb.gene .== "negative_control")/sum(stricter_barcodes.gene .!= "negative_control")
println("Estimated False Discovery rate: ", estimated_false_discovery_rate)
Number of gene encoding barcodes: 1118
Estimated False Discovery rate: 0.015330875292705262

The less strict parameter set decodes about 40% more gene encoding barcodes at a cost of having twice the estimated false discovery rate. Since the estimated false discovery rate is still small, it is probably an acceptable trade off.

To save your results, use the CSV.write command.

CSV.write("example_results.csv", barcodes)
"example_results.csv"