A universal, genome-wide guide finder for CRISPR/Cas9 targeting in microbial genomes

Background The CRISPR/Cas system has significant potential to facilitate gene editing in a variety of bacterial species. CRISPR interference (CRISPRi) and CRISPR activation (CRISPRa) represent modifications of the CRISPR/Cas9 system utilizing a catalytically inactive Cas9 protein for transcription repression or activation, respectively. While CRISPRi and CRISPRa have tremendous potential to systematically investigate gene function in bacteria, no pan-bacterial, genome-wide tools exist for guide discovery. We have created Guide Finder: a customizable, user-friendly program that can design guides for any annotated bacterial genome. Results Guide Finder designs guides from NGG PAM sites for any number of genes using an annotated genome and fasta file input by the user. Guides are filtered according to user-defined design parameters and removed if they contain any off-target matches. Iteration with lowered parameter thresholds allows the program to design guides for genes that did not produce guides with the more stringent parameters, a feature unique to Guide Finder. Guide Finder has been tested on a variety of diverse bacterial genomes, on average finding guides for 95% of genes. Moreover, guides designed by the program are functionally useful—focusing on CRISPRi as a potential application—as demonstrated by essential gene knockdown in two staphylococcal species. Conclusions Through the large-scale generation of guides, this open-access software will improve accessibility to CRISPR/Cas studies for a variety of bacterial species.

gene function in bacteria, no pan-bacterial, genome-wide tools exist for guide discovery. 23 We have created Guide Finder: a customizable, user-friendly program that can design 24 guides for any annotated bacterial genome. 25 Results: Guide Finder designs guides from NGG PAM sites for any number of genes 26 using an annotated genome and fasta file input by the user. Guides are filtered 27 according to user-defined design parameters and removed if they contain any off-target 28 matches. Iteration with lowered parameter thresholds allows the program to design 29 guides for genes that did not produce guides with the more stringent parameters, a 30 feature unique to Guide Finder. Guide Finder has been tested on a variety of diverse 31 bacterial genomes, on average finding guides for 95% of genes. Moreover, guides 32 designed by the program are functionally useful-focusing on CRISPRi as a potential 33 application-as demonstrated by essential gene knockdown in two staphylococcal 34

species. 35
Conclusions: Through the large-scale generation of guides, this open-access software 36 will improve accessibility to CRISPR/Cas studies for a variety of bacterial species. 37 Background 38 customizability and flexibility of user-defined design constraints, pan-bacterial 83 applicability, gene iteration, and paired guide selection in a user-friendly format. 84 Thus, we have created Guide Finder to address these limitations. Our program has a 85 simple input for any annotated complete or draft genome and accepts default or user-86 defined guide design parameters, which is important given the broad characteristics of 87 different microbial genomes like GC content, size, or the presence of repetitive regions. 88 Finally, the automated and iterative guide design is capable of designing guides to 89 target any number of genes for any annotated bacterial genome, including optimizing 90 selection of multiple guides for double targeting. Focusing on its applications for 91 CRISPRi, we have demonstrated its utility in selecting guides genome-wide for a 92 diverse set of bacterial species and its ability to select functional guides suitable for 93 gene knockdown. Guide Finder is the first publically available, automated guide 94 selection program designed specifically for bacteria that incorporates user-defined 95 filtering parameters, off-target searching, and iterative guide design with utility for both 96 complete and draft genome annotations. This tool will help facilitate flexible, large-scale 97 guide design and thus improve access to high-throughput studies of gene function. 98

Implementation 99
Guide Finder is written in the R programming language and is available free to use. 100 Guide Finder was written such that it can be used to find guides for both complete and 101 draft genomes, recognizing that many users may not have a complete genome for their 102 organism of interest. The workflow of the program, including inputs and outputs, is 103 described (Fig. 1).

Inputs & Outputs 105
Inputs 106 Guide Finder is capable of designing guides for both complete and draft genomes, 107 although the inputs differ slightly. 108

Complete Genome 109
For complete genomes, users simply supply the Genbank accession number and fasta 110 file. 111

Draft Genome 112
Given the variable organization and notation of draft genomes, annotated draft genome 113 files must be preprocessed prior to input into the program. Utilizing the supplied pre-114 processing script, multi-sequence fasta files (e.g. fasta files containing sequence 115 information for multiple contigs) must be concatenated into a single sequence, with the 116 addition of a series of N's between contigs. The coordinates of the coding sequences 117 are then identified by aligning the coding sequences against the concatenated fasta file 118 using BLAST and adjusted to the format required by the main Guide Finder script(i.e. 119 the smaller coordinate designated as the "start" coordinate). These coordinates are then 120 input into the main script, along with the single-sequence fasta file. 121 Outputs 122 There are two main outputs of the guide finder program: Top Hits and Paired Guides 123 lists. Intermediate outputs, such as a list of all possible unfiltered guides, are also made 124 available to the user for reference.

Top Hits List 126
A list of guides preferentially selected based on their proximity to the transcription start 127 site. The maximum number of guides supplied per gene is set by the user. 128 potential guides, many of which will be lost to filtering, as described below. 148

Guide Filtering 149
Guides are filtered according to default and user-defined parameters. By default, the 150 program removes any guides that contain a homopolymer run of As or Ts and guides of 151 inadequate length (<20 bp). A user-set threshold is used to filter based on maximum 152 distance from the start site, as targets closest to the transcriptional start site are most 153 likely to disrupt gene function. Guides are also filtered to minimize potential off target 154 effects. The first 12 nucleotides closest to and including the PAM site for each guide is 155 aligned to the fasta file and guides that match to more than one location in the genome 156 are discarded. 157

Final Guide Selection 158
For each PAM site, the program selects the guide of the greatest length that meets GC 159 minimum set by the user. From these guides, two final guides lists are created: Top Hits 160 and Paired Guides, which provide guides and guide pairs suitable for single and dual 161 gene knockdown, respectively. 162

Iteration 163
The program identifies genes that did not produce any guides with the primary 164 parameters. Users have the option to lower these thresholds and re-run these genes 165 through the program to identify additional guides. Users can elect to reduce the GC 166 minimum, increase the maximum guide distance from the transcription start site, retain 167 guides that contain homopolymers, and relax off target searching. Users can relax each 168 of these guide design constrains individually or in combination. 169

Results & Discussion 170
Guide Finder is intended to reduce the effort required to design guides targeting genes 171 in any bacterial species and accommodates both complete and draft genome 172 annotations, the latter of which is important given the large number of unique isolates 173 being sequenced and investigated. The program is customizable and incorporates user-174 defined guide constraints, including: minimum GC content, maximum distance from the 175 transcription start site, and distance between guides (for dual targeting knockdown). 176 Recognizing the diversity of bacterial species, we aimed to create a program where 177 users could tailor guide design parameters based on the characteristics of their 178 organism of interest, for example, setting a relatively low guide GC minimum while 179 working with a GC poor species. Additionally, the program identifies genes for which no 180 guides meeting set thresholds could be identified, allowing iterative guide-calling to 181 maximize the number of genes targeted. Users have the option to re-run these genes 182 through the guide finder program with relaxed design constraints to identify additional 183 guides. 184 Although users can elect to design guides for just one gene or a handful of genes, if 185 desired, the program is intended to be particularly useful for large-scale guide design. 186 To investigate these intended uses, we conducted tests in silico and in vitro focusing on 187 CRISPRi to determine: 1) the utility of the program across diverse bacterial species and 188 2) the ability of the program to design functional guides.

Guides for diverse genomes 190
Testing on Complete Genomes 191 Guide Finder was utilized to create guides across the genome for a diverse set of ten 192 complete bacterial genomes (Table 1) rational design constraints. For each genome, genes that did not produce suitable guide 199 pairs or single guides were identified by the program. These genes were re-run with the 200 following constraints: a GC minimum of 30%, a maximum distance from the TSS of 201 50%, retention of guides with homopolymers, and relaxed off-target searching. These 202 parameters were relaxed individually and in combination. The guide finder program was 203 able to successfully select guides for each of the diverse genomes irrespective of 204 genome size or GC content, but differences in output and run-time were observed (Fig  205   2). 206

GC Content 207
As expected, genomes with lower GC content (<40%) were less successful in producing 208 usable guides for each gene. For S. epidermidis, S. aureus, A. baumanni, and L. 209 jensenii genomes (GC contents of 33%, 32%, 39%, and 34%, respectively), the 210 percentage of genes producing guides under the primary filtering thresholds was considerably lower than the average for all ten genomes (87.5%) at 68%, 67%, 79%, 212 and 79%, respectively. The average for genomes > 40% GC content was 97.5%. 213 However, for genomes with low GC content, iteration with lowered parameters was very 214 useful in recovering genes that did not originally produce guides. When each design 215 constraint was relaxed in combination, the percentage of genes with guides improved to 216 98%, 93%, 89%, and 96% for S. epidermidis, S. aureus, A baumanni, and L. jensenii, 217 respectively ( Fig. 2A). 218

Gene Duplications 219
We hypothesized that a genome known to contain a high percentage of gene 220 duplications, such as Mycobacterium tuberculosis, would have difficulty producing a 221 large number of usable guides, owing to the high probability of off-target matching. 222 Surprisingly, however, this genome was able to create guides for 98% of genes using 223 primary thresholds, probably owing to its relatively high GC content (65%). information becomes available, Guide Finder allows users to set a minimum distance 242 threshold that guides selected for dual knockdown must meet. As expected, paired 243 guide creation-including a 100 bp distance-between-guides threshold-is feasible for 244 fewer genes than single guide creation, owing to the fact that some genes may produce 245 only a single suitable guide or produce guides that are located in close proximity ( were obtained from NCBI. Incomplete genomes were pre-processed with the supplied 253 script to identify gene coordinates. Incomplete genome annotations were successfully used to design guides across the genome for each of the three species tested. In terms 255 of percentage of genes with identified guides and run-time, there are no appreciable 256 differences between complete and incomplete genome annotations (Fig. 2C). This 257 result highlights the utility of the program for both types of genome annotation files. 258

Essential gene knockdown to validate guides 259
We evaluated the functional utility of Guide Finder guides by random assessment 260 of essential gene knockdown in Staphylococcus (S.) aureus and S. epidermidis, 261 focusing on CRISPRi as a potential application. Nearly all guides showed effective 262 knockdown manifested as growth defects with the exception of groEL and rpoC (Fig. 3). 263 Further investigation measuring transcription of the locus using qPCR showed that the 264 guide targeting rpoC did not reduce transcription (highlighting the value of predicting 265 and testing multiple guides). groEL was effectively targeted but was either non-essential 266 under our tested condition, or residual transcript could be rescuing cell function (Fig. 4).

Availability of data and material 304
Data sharing is not applicable to this article as no datasets were generated or analyzed 305 during the current study. Genomes used for analysis were obtained from NCBI and 306 PATRIC. Specific strains (including accession and genome ID numbers) are listed in the 307 supplemental material. 308

Competing interests 309
The authors declare that they have no competing interests 310 were performed for each assay as a technical replicate. 373

Genomes Used in Analysis 374
Genomes used for complete genome analysis were obtained from NCBI. Accession 375 numbers for each strain is listed below: 376  . Essential gene knockdown. Essential genes were targeted for knockdown in S. aureus (A) and S. epidermidis (B) and growth curves were created from OD measurements over a 16 hour growth assay. ATc= anhydrotetracycline induction, uninduced = control. Control: empty vector (no guide) acts as a control, indicating that the growth defect is not due to ATc administration. With the exception of groEL and rpoC, the knockdown of most essential genes caused a growth defect, as expected.