Rule Set 2: the Science Behind the Metric
In 2014, Doench et al. developed the “on-target score” for measuring how efficient a given sgRNA is at guiding Cas9 to the correct spot for cleaving. The score has helped scientists select better sgRNAs for CRISPR experiments. In February 2016, Doench, Fusi et al. published an improved algorithm, the Rule Set 2 score, to measure the on-target activity of sgRNA.
In this post we dive in to explain how Doench, Fusi et al. improved the prediction power of the the Rule Set 2 score!
1. Doench, Fusi et al. used more data to model the on-target activity of sgRNAs.
The first version of the on-target score was developed by Doench et al. in 2014 using data from 1,841 sgRNAs. To improve the predictive power of the new modeling, Doench, Fusi et al. targeted all possible NGG PAM sites in 15 genes. Combined with the previous dataset, they used more than 4,000 sgRNAs to develop the Rule Set 2 score.
2. They used a new modeling approach that captures more information.
Previously, Doench et al. trained a model to identify whether a given sgRNA is in the top 20% of on-target activity among all sgRNA. This kind of model is called “classification modeling” since the model classifies an sgRNA as one of two types: top 20% (high on-target activity) or not top 20% (low on-target activity).
This approach is convenient and easy to understand. However, it loses a lot of information: a top 1% sgRNA might be much more efficient than a top 10% sgRNA in your CRISPR experiments, but their on-target scores would be equivalent in the previous modeling approach.
To overcome this problem, in developing the Rule Set 2 score, Doench, Fusi et al. used machine learning algorithms that can rank all sgRNAs. This kind of computational modeling is called a “regression model.” Using this modeling method, Rule Set 2 scores for a top 1% sgRNA will be higher than those for a top 10% sgRNA. Now you can compare which sgRNA may have the highest on-target activity among all sgRNAs.
3. They included additional feature sets
Previously, Doench et al. mainly used three features to predict the previous on-target score of sgRNAs: the position of single nucleotides; the position of dinucleotides; and the number of GC bases in the sgRNA. In the new model developed by Doench, Fusi et al., they tested whether additional features sets would be helpful to predict the on-target activity of sgRNAs.
Their machine learning algorithm developed a final model that included additional new feature sets. Previous feature sets explain 58% of the new mode's results (as measured by Gini importance), suggesting that the old model, while useful, did not take some important information into account. There are three other features that explain more than 10% of the final model:
The frequency of single nucleotides and dinucleotides.
Location of the sgRNA within the protein coding region.
Melting temperatures of the first 5, middle 8, and last 5 base pairs of the sgRNA.
How they validated the new Rule Set 2 model
To validate the Rule Set 2 score, Doench, Fusi et al. tested how well the score can predict the on-target activity of sgRNAs in three independent datasets. The Rule Set 2 score can distinguish effective sgRNAs from ineffective ones with P-values of 5.9 x 10-80, 2.1 x 10-24 and 3.9 x 10-35. In comparison, the previous on-target score shows less distinction with the same dataset. The P-values are 1.4 x 10-32, 1.8 x 10-16 and 1.1 x 10-11 respectively.
These results suggest that the Rule Set 2 can help you rank and pick candidate sgRNAs for CRISPR experiments with better prediction power than the previous on-target score algorithm.
If you want to delve past this introduction deeper into the model, you can check out Doench, Fusi et al.’s paper here.