A Guide RNA Sequence Design Platform for the CRISPR/Cas9 System for Model Organism Genomes

Cas9/CRISPR has been reported to efficiently induce targeted gene disruption and homologous recombination in both prokaryotic and eukaryotic cells. Thus, we developed a Guide RNA Sequence Design Platform for the Cas9/CRISPR silencing system for model organisms. The platform is easy to use for gRNA design with input query sequences. It finds potential targets by PAM and ranks them according to factors including uniqueness, SNP, RNA secondary structure, and AT content. The platform allows users to upload and share their experimental results. In addition, most guide RNA sequences from published papers have been put into our database.


Introduction
Gene engineering technology has always been a hot topic in life science research. With the development of gene modification technology, certain genes can be knocked out or knocked down to a lower level. The appearance of zinc finger nuclease (ZFN) and tale nuclease (TALEN) has greatly accelerated progress in this field, but their efficiency is often unpredictable and it is difficult to target selected genes [1][2][3][4][5][6][7][8].
Recently, Cas9/CRISPR has been reported to successfully induce targeted gene disruption and homologous recombination in both prokaryotic and eukaryotic cells with higher efficiency compared with ZFN and TALEN [9][10][11][12][13]. Additionally, it is easier to design guide sequence and easy to use for Cas9/CRISPR system [10]. This novel technology will be of great potential for application in both research field and clinical trials.
However, there is no available tool for the guide RNA design of Cas9/CRISPR silencing system. Although Mali et al. have reported the construction of unique whole human genome guide RNA library, covering more than 40% human exons [9], they did not provide a tool for researchers to design novel target sequences for other model organisms.
Existed library also did not take into consideration related influencing factors, such as SNP, deletion or insertion on the genome, and potential RNA secondary structure. According to our current understanding of the gRNA maturing process, the secondary structure of gRNA is crucial for Cas9-gRNA complex [14]. The 20 bp guide RNA sequence is used to bind with target site in genomes. If they are mostly involved into RNA loops, the efficiency to bind with target sites would be low. Thus, this factor should be taken into consideration. Besides, the interference efficiency is probably closely related to the melting temperature of the gRNA-DNA hybrid. A relatively high AT content is negatively correlated with the off-target effect, and thus sequence with extremely low AT percentage is, to some extent, not recommended [9].
Thus, we developed an online platform for the guide RNA design of the Cas9/CRISPR silencing system (http://cas9.cbi.pku.edu.cn/), with DNA variants information integrated. This tool helps researchers design their candidate guide RNA sequences more easily and provides assistance  for users to choose better candidates based on preliminary results.

Materials and Methods
Both guide RNA sequences and their corresponding efficiency were manually collected from the literature and stored in our database. For designing guide RNA, we used a Java framework mainly containing 5 steps, and connecting to Tomcat web server.
In the first step, the program would find any candidate sequences based on the N 20 NGG sequence pattern principle, where NGG represents PAM sequence, by utilizing Java regular expression matching. In the second step, the program would put all the candidate sequences to a fasta file and run bowtie 0.12.9 to check if they could be mapped on selected model organism's genome uniquely [15]. The parameters for bowtie were "-f -v 1 -k 10 -l 16 -S, " as "-f " told bowtie the input was fasta file, "-v 1" for only allowing at most one mismatch, "-k 10" reporting up to 10 good alignments, "-l 16" setting seed length to 16, and "-S" outputting sam format. As the length of target region was only 23 bp, the default seed length 28 for bowtie was not proper for this job, so we adjusted it to 16. We thought the number of mismatches might largely affect effectiveness, and this step mainly focused on checking the mapping uniqueness, so we just looked for hits with at most one mismatch and output at most 10 hits. The mapping result would be parsed in Java, and then, in the third step, would call tabix 0.2.5 to find out any overlapped SNPs or indels as reported in dbSNP135 [16][17][18], if the target genome was human hg19. The dbSNP135 vcf file was downloaded from GATK bundle. In the fourth step, it would predict RNA  [19]. In the last step, the program rearranged all the information for the designed gRNA and formatted it to better-looking HTML. The AT% and the distance of the variants to the 3 end of the target region were also calculated. The output gRNAs were sorted by both number of mapping hits and number of overlapping SNPs. The time consumption for this pipeline was mainly on running bowtie and sometimes tabix, when there existed many target sequences, and was roughly about three seconds for one query sequence.

Results and Discussion
Multiple gene sequences are allowed for batch gRNA design and the streamline of this platform is shown in Figure 1. The results contain genomic loci information of gRNAs and SNP/INDEL inside them. This would help researchers choose a more unique target candidate and avoid SNP/insertion/deletion. Moreover, this platform evaluates all candidates based on their RNA secondary structure and AT content, allowing users to choose better candidates ( Figure 2). Recently, Jiang et al. report that only the first six base pairs near PAM are of great importance for recognition efficiency in bacteria [20]. It is unknown whether or not this is still the case for eukaryotic or even mammalian cells. We will keep updating our algorithm to rank candidate gRNAs.
We conducted a validation by using those reported results in our platform on factors, such as uniqueness, SNP, and base in loops (Table 1, italic font represents low efficient targets). The more unique, with fewer SNPs and base in loops, generally the gRNA is more efficient. For the given gene PVALB, the first target sequence is 50% more efficient than the rest two, since the first has 0 SNP while the rest have 3 or 2 SNPs. The first target sequence has fewer base pairs involved in RNA secondary structure loops, allowing it to bind more with target genome, while the rest two both have 9 base pairs in loops. For the given gene AAVS1, the first target is more than twofold efficient than the other, since the other one has an off-target site in genomes. For the given gene VEGFA, the first one is about half efficient with the rest two, since it has 1 SNP while the rest have none.
AT content is crucial factor as those previously mentioned, since evidence is not clear. Thus, we list it here as a consideration for users.

Conclusions
Our platform is an easy-to-use software to identify potential efficient gRNA sites within given sequences for model organisms, avoiding off-target effects and SNPs. This platform also allows users to search existing guide RNA/protospacer sequences and share their results. We have manually extracted most reported gRNA/protospacer sequences into our database for reference and will expand it with newly published work.

Authors' Contribution
Ming Ma and Adam Y. Ye contributed equally to this work. Ming Ma conceived the idea and Adam Y. Ye, Weiguo Zheng conducted programming and website construction. Lei Kong supervised the whole job and give guidance. Ming Ma, Adam Y. Ye, and Lei Kong drafted the paper.