A place to cache linked articles (think custom and personal wayback machine)
Du kan inte välja fler än 25 ämnen Ämnen måste starta med en bokstav eller siffra, kan innehålla bindestreck ('-') och vara max 35 tecken långa.

index.md 29KB

2 år sedan
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
  1. title: A Tiny, Static, Full-Text Search Engine using Rust and WebAssembly
  2. url: https://endler.dev/2019/tinysearch/
  3. hash_url: d97914db7d2e525edc27669adbc0f917
  4. <div class="info"><p>I wrote a basic search module that you can add to a static website. It's very lightweight (50kB-100kB gzipped) and works with Hugo, Zola, and Jekyll. Only searching for entire words is supported. Try the search box on the left for a demo. <a href="https://github.com/mre/tinysearch">The code is on Github</a>.</p></div><p>Static site generators are magical. They combine the best of both worlds: dynamic content without sacrificing performance.</p><p>Over the years, this blog has been running on <a href="https://github.com/mre/mre.github.io.v1">Jekyll</a>, <a href="https://github.com/mre/mre.github.io.v2">Cobalt</a>, and, lately, <a href="https://www.getzola.org/">Zola</a>.</p><p>One thing I always disliked, however, was the fact that static websites don't come with "static" search engines, too. Instead, people resort to <a href="https://cse.google.com/about">custom Google searches</a>, external search engines like <a href="https://www.algolia.com/">Algolia</a>, or pure JavaScript-based solutions like <a href="https://lunrjs.com/">lunr.js</a> or <a href="http://elasticlunr.com/">elasticlunr</a>.</p><p>All of these work fine for most sites, but it never felt like the final answer.</p><p>I didn't want to add yet another dependency on Google; neither did I want to use a stand-alone web-backend like Algolia, which adds latency and is proprietary.</p><p>On the other side, I'm not a huge fan of JavaScript-heavy websites. For example, just the search indices that lunr creates can be <a href="https://github.com/olivernn/lunr.js/issues/268#issuecomment-304490937">multiple megabytes in size</a>. That feels lavish - even by today's bandwidth standards. On top of that, <a href="https://v8.dev/blog/cost-of-javascript-2019">parsing JavaScript is still time-consuming</a>.</p><p>I wanted some simple, lean, and self-contained search, that could be deployed next to my other static content.</p><p>As a consequence, I refrained from adding search functionality to my blog at all. That's unfortunate because, with a growing number of articles, it gets harder and harder to find relevant content.</p><h2 id="the-idea"><a class="anchor" href="#the-idea"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>The Idea</h2><p>Many years ago, in 2013, I read <a href="https://www.stavros.io/posts/bloom-filter-search-engine/">"Writing a full-text search engine using Bloom filters"</a> — and it was a revelation.</p><p>The idea was simple: Let's run all my blog articles through a generator that creates a tiny, self-contained search index using this magical data structure called a ✨<em>Bloom Filter</em> ✨.</p><h2 id="wait-what-s-a-bloom-filter"><a class="anchor" href="#wait-what-s-a-bloom-filter"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>Wait, what's a Bloom Filter?</h2><p>A <a href="https://en.wikipedia.org/wiki/Bloom_filter">Bloom filter</a> is a space-efficient way to check if an element is in a set.</p><p>The trick is that it doesn't store the elements themselves; it just knows with some confidence that they were stored before. In our case, it can say with a certain <em>error rate</em> that a word is in an article.<figure><img alt="A Bloom filter stores a
  5. 'fingerprint' (a number of hash values) of all input values instead of the raw
  6. input. The result is a low-memory-footprint data structure. This is an example
  7. of 'hello' as an input." src="https://endler.dev/2019/tinysearch/bloomfilter.svg"><figcaption>A Bloom filter stores a 'fingerprint' (a number of hash values) of all input values instead of the raw input. The result is a low-memory-footprint data structure. This is an example of 'hello' as an input.</figcaption></figure></p><p>Here's the Python code from the original article that generates the Bloom filters for each post (courtesy of <a href="https://www.stavros.io">Stavros Korokithakis</a>):</p><pre class="language-python" data-lang="python"><code class="language-python" data-lang="python"><span>filters </span><span>= </span><span>{}
  8. </span><span>for </span><span>name</span><span>, </span><span>words </span><span>in </span><span>split_posts</span><span>.</span><span>items</span><span>():
  9. </span><span> filters[name] </span><span>= </span><span>BloomFilter</span><span>(</span><span>capacity</span><span>=</span><span>len</span><span>(words)</span><span>, </span><span>error_rate</span><span>=</span><span>0</span><span>.</span><span>1</span><span>)
  10. </span><span> </span><span>for </span><span>word </span><span>in </span><span>words:
  11. </span><span> filters[name]</span><span>.</span><span>add</span><span>(word)
  12. </span></code></pre><p>The memory footprint is extremely small, thanks to <code>error_rate</code>, which allows for a negligible number of false positives.</p><p>I immediately knew that I wanted something like this for my homepage. My idea was to directly ship the Bloom filters and the search engine to the browser. I could finally have a small, static search without the need for a backend!</p><h2 id="headaches"><a class="anchor" href="#headaches"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>Headaches</h2><p>Disillusionment came quickly.</p><p>I had no idea how to bundle and minimize the generated Bloom filters, let alone run them on clients. The original article briefly touches on this:</p><blockquote><p>You need to implement a Bloom filter algorithm on the client-side. This will probably not be much longer than the inverted index search algorithm, but it’s still probably a bit more complicated.</p></blockquote><p>I didn't feel confident enough in my JavaScript skills to pull this off. Back in 2013, NPM was a mere three years old, and WebPack just turned one, so I also didn't know where to look for existing solutions.</p><p>Unsure what to do next, my idea remained a pipe dream.</p><h2 id="a-new-hope"><a class="anchor" href="#a-new-hope"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>A New Hope</h2><p>Five years later, in 2018, the web had become a different place. Bundlers were ubiquitous, and the Node ecosystem was flourishing. One thing, in particular, revived my dreams about the tiny static search engine: <a href="https://webassembly.org/">WebAssembly</a>.</p><blockquote><p>WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable target for compilation of high-level languages like C/C++/Rust, enabling deployment on the web for client and server applications. [<a href="https://webassembly.org/">source</a>]</p></blockquote><p>This meant that I could use a language that I was familiar with to write the client-side code — Rust! 🎉</p><p>My journey started with a <a href="https://github.com/mre/tinysearch/commit/82c1d36835348718f04c9ca0dd2c1ebf8b19a312">prototype back in January 2018</a>. It was just a direct port of the Python version from above:</p><pre class="language-rust" data-lang="rust"><code class="language-rust" data-lang="rust"><span>let mut</span><span> filters </span><span>= </span><span>HashMap</span><span>::</span><span>new()</span><span>;
  13. </span><span>for </span><span>(name</span><span>,</span><span> words) </span><span>in</span><span> articles {
  14. </span><span> </span><span>let mut</span><span> filter </span><span>= </span><span>BloomFilter</span><span>::</span><span>with_rate(</span><span>0.1</span><span>,</span><span> words</span><span>.</span><span>len</span><span>() </span><span>as </span><span>u32</span><span>)</span><span>;
  15. </span><span> </span><span>for</span><span> word </span><span>in</span><span> words {
  16. </span><span> filter</span><span>.</span><span>insert</span><span>(</span><span>&amp;</span><span>word)</span><span>;
  17. </span><span> }
  18. </span><span> filters</span><span>.</span><span>insert</span><span>(name</span><span>,</span><span> filter)</span><span>;
  19. </span><span>}
  20. </span></code></pre><p>While I managed to create the Bloom filters for every article, I <em>still</em> had no clue how to package it for the web... until <a href="https://github.com/rustwasm/wasm-pack/commit/125431f97eecb6f3ca5122f8b345ba5b7eee94c7">wasm-pack came along in February 2018</a>.</p><h2 id="whoops-i-shipped-some-rust-code-to-your-browser"><a class="anchor" href="#whoops-i-shipped-some-rust-code-to-your-browser"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>Whoops! I Shipped Some Rust Code To Your Browser.</h2><p>Now I had all the pieces of the puzzle:</p><ul><li>Rust — A language I was comfortable with</li><li><a href="https://github.com/rustwasm/wasm-pack">wasm-pack</a> — A bundler for WebAssembly modules</li><li>A working prototype that served as a proof-of-concept</li></ul><p>The search box you see on the left side of this page is the outcome. It fully runs on Rust using WebAssembly (a.k.a the <a href="https://twitter.com/timClicks/status/1181822319620063237">RAW stack</a>). Try it now if you like.</p><p>There were quite a few obstacles along the way.</p><h2 id="bloom-filter-crates"><a class="anchor" href="#bloom-filter-crates"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>Bloom Filter Crates</h2><p>I looked into a few Rust libraries (crates) that implement Bloom filters.</p><p>First, I tried jedisct1's <a href="https://github.com/jedisct1/rust-bloom-filter">rust-bloom-filter</a>, but the types didn't implement <a href="https://docs.serde.rs/serde/trait.Serialize.html">Serialize</a>/<a href="https://docs.serde.rs/serde/trait.Deserialize.html">Deserialize</a>. This meant that I could not store my generated Bloom filters inside the binary and load them on the client-side.</p><p>After trying a few others, I found the <a href="https://github.com/seiflotfy/rust-cuckoofilter">cuckoofilter</a> crate, which supported serialization. The behavior is similar to Bloom filters, but if you're interested in the differences, you can look at <a href="https://brilliant.org/wiki/cuckoo-filter/">this summary</a>.</p><p>Here's how to use it:</p><pre class="language-rust" data-lang="rust"><code class="language-rust" data-lang="rust"><span>let mut</span><span> cf </span><span>= </span><span>cuckoofilter</span><span>::</span><span>new()</span><span>;
  21. </span><span>
  22. </span><span>// Add data to the filter
  23. </span><span>let</span><span> value</span><span>: </span><span>&amp;</span><span>str </span><span>= </span><span>"hello world"</span><span>;
  24. </span><span>let</span><span> success </span><span>=</span><span> cf</span><span>.</span><span>add</span><span>(value)</span><span>?</span><span>;
  25. </span><span>
  26. </span><span>// Lookup if data was added before
  27. </span><span>let</span><span> success </span><span>=</span><span> cf</span><span>.</span><span>contains</span><span>(value)</span><span>;
  28. </span><span>// success ==&gt; true
  29. </span></code></pre><p>Let's check the output size when bundling the filters for ten articles on my blog using cuckoo filters:</p><pre><code><span>~/C/p/tinysearch ❯❯❯ l storage
  30. </span><span>Permissions Size User Date Modified Name
  31. </span><span>.rw-r--r-- 44k mendler 24 Mar 15:42 storage
  32. </span></code></pre><p><strong>44kB</strong> doesn't sound too shabby, but these are just the cuckoo filters for ten articles, serialized as a Rust binary. On top of that, we have to add the search functionality and the helper code. In total, the client-side code weighed in at <strong>216kB</strong> using vanilla wasm-pack. Too much.</p><h2 id="trimming-binary-size"><a class="anchor" href="#trimming-binary-size"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>Trimming Binary Size</h2><p>After the sobering first result of 216kB for our initial prototype, we have a few options to bring the binary size down.</p><p>The first is following <a href="https://github.com/johnthagen">johnthagen's</a> advice on <a href="https://github.com/johnthagen/min-sized-rust">minimizing Rust binary size</a>.</p><p>By setting a few options in our <code>Cargo.toml</code>, we can shave off quite a few bytes:</p><pre><code><span>"opt-level = 'z'" =&gt; 249665 bytes
  33. </span><span>"lto = true" =&gt; 202516 bytes
  34. </span><span>"opt-level = 's'" =&gt; 195950 bytes
  35. </span></code></pre><p>Setting <code>opt-level</code> to <code>s</code> means we trade size for speed, but we're preliminarily interested in minimal size anyway. After all, a small download size also improves performance.</p><p>Next, we can try <a href="https://github.com/rustwasm/wee_alloc">wee_alloc</a>, an alternative Rust allocator producing a small <code>.wasm</code> code size.</p><blockquote><p>It is geared towards code that makes a handful of initial dynamically sized allocations, and then performs its heavy lifting without any further allocations. This scenario requires some allocator to exist, but we are more than happy to trade allocation performance for small code size.</p></blockquote><p>Exactly what we want. Let's try!</p><pre><code><span>"wee_alloc and nightly" =&gt; 187560 bytes
  36. </span></code></pre><p>We shaved off another 4% from our binary.</p><p>Out of curiosity, I tried to set <a href="https://doc.rust-lang.org/rustc/codegen-options/index.html#codegen-units">codegen-units</a> to 1, meaning we only use a single thread for code generation. Surprisingly, this resulted in a slightly smaller binary size.</p><pre><code><span>"codegen-units = 1" =&gt; 183294 bytes
  37. </span></code></pre><p>Then I got word of a Wasm optimizer called <code>binaryen</code>. On macOS, it's available through homebrew:</p><pre><code><span>brew install binaryen
  38. </span></code></pre><p>It ships a binary called <code>wasm-opt</code> and that shaved off another 15%:</p><pre><code><span>"wasm-opt -Oz" =&gt; 154413 bytes
  39. </span></code></pre><p>Then I removed web-sys as we don't have to bind to the DOM: 152858 bytes.</p><p>There's a tool called <a href="https://github.com/rustwasm/twiggy">twiggy</a> to profile the code size of Wasm binaries. It printed the following output:</p><pre><code><span>twiggy top -n 20 pkg/tinysearch_bg.wasm
  40. </span><span> Shallow Bytes │ Shallow % │ Item
  41. </span><span>─────────────┼───────────┼────────────────────────────────
  42. </span><span> 79256 ┊ 44.37% ┊ data[0]
  43. </span><span> 13886 ┊ 7.77% ┊ "function names" subsection
  44. </span><span> 7289 ┊ 4.08% ┊ data[1]
  45. </span><span> 6888 ┊ 3.86% ┊ core::fmt::float::float_to_decimal_common_shortest::hdd201d50dffd0509
  46. </span><span> 6080 ┊ 3.40% ┊ core::fmt::float::float_to_decimal_common_exact::hcb5f56a54ebe7361
  47. </span><span> 5972 ┊ 3.34% ┊ std::sync::once::Once::call_once::{{closure}}::ha520deb2caa7e231
  48. </span><span> 5869 ┊ 3.29% ┊ search
  49. </span></code></pre><p>From what I can tell, the biggest chunk of our binary is occupied by the raw data section for our articles. Next up, we got the function headers and some float to decimal helper functions, that most likely come from deserialization.</p><p>Finally, I tried <a href="https://github.com/rustwasm/wasm-snip">wasm-snip</a>, which replaces a WebAssembly function's body with an <code>unreachable</code> like so, but it didn't reduce code size:</p><pre><code><span>wasm-snip --snip-rust-fmt-code --snip-rust-panicking-code -o pkg/tinysearch_bg_snip.wasm pkg/tinysearch_bg_opt.wasm
  50. </span></code></pre><p>After tweaking with the parameters of the cuckoo filters a bit and removing <a href="https://en.wikipedia.org/wiki/Stop_words">stop words</a> from the articles, I arrived at <strong>121kB</strong> (51kB gzipped) — not bad considering the average image size on the web is <a href="https://httparchive.org/reports/state-of-images#bytesImg">around 900kB</a>. On top of that, the search functionality only gets loaded when a user clicks into the search field.</p><h2 id="update"><a class="anchor" href="#update"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>Update</h2><p>Recently I moved the project from cuckoofilters to <a href="https://arxiv.org/abs/1912.08258">XOR filters</a>. I used the awesome <a href="https://github.com/ayazhafiz/xorf">xorf</a> project, which comes with built-in serde serialization. which allowed me to remove a lot of custom code.</p><p>With that, I could reduce the payload size by another 20-25% percent. I'm down to <strong>99kB</strong> (<strong>49kB gzipped</strong>) on my blog now. 🎉</p><p>The new version is released <a href="https://crates.io/crates/tinysearch">on crates.io</a> already, if you want to give it a try.</p><h2 id="frontend-and-glue-code"><a class="anchor" href="#frontend-and-glue-code"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>Frontend- and Glue Code</h2><p>wasm-pack will auto-generate the JavaScript code to talk to Wasm.</p><p>For the search UI, I customized a few JavaScript and CSS bits from <a href="https://www.w3schools.com/howto/tryit.asp?filename=tryhow_js_autocomplete">w3schools</a>. It even has keyboard support! Now when a user enters a search query, we go through the cuckoo filter of each article and try to match the words. The results are scored by the number of hits. Thanks to my dear colleague <a href="https://github.com/jorgelbg/">Jorge Luis Betancourt</a> for adding that part.</p><p><img alt="Video of the search functionality" src="./anim-opt2.gif"></p><p>(Fun fact: this animation is about the same size as the uncompressed Wasm search itself.)</p><h2 id="caveats"><a class="anchor" href="#caveats"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>Caveats</h2><p>Only whole words are matched. I would love to add prefix-search, but the binary became too big when I tried.</p><h2 id="usage"><a class="anchor" href="#usage"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>Usage</h2><p>The standalone binary to create the Wasm file is called <code>tinysearch</code>. It expects a single path to a JSON file as an input:</p><pre><code><span>tinysearch path/to/corpus.json
  51. </span></code></pre><p>This <code>corpus.json</code> contains the text you would like to index. The format is pretty straightforward:</p><pre class="language-json" data-lang="json"><code class="language-json" data-lang="json"><span>[
  52. </span><span> {
  53. </span><span> </span><span>"title"</span><span>: </span><span>"Article 1"</span><span>,
  54. </span><span> </span><span>"url"</span><span>: </span><span>"https://example.com/article1"</span><span>,
  55. </span><span> </span><span>"body"</span><span>: </span><span>"This is the body of article 1."
  56. </span><span> }</span><span>,
  57. </span><span> {
  58. </span><span> </span><span>"title"</span><span>: </span><span>"Article 2"</span><span>,
  59. </span><span> </span><span>"url"</span><span>: </span><span>"https://example.com/article2"</span><span>,
  60. </span><span> </span><span>"body"</span><span>: </span><span>"This is the body of article 2."
  61. </span><span> }
  62. </span><span>]
  63. </span></code></pre><p>You can generate this JSON file with any static site generator. <a href="https://github.com/mre/mre.github.io/tree/1c731717b48afb584e54ca4dd5fd649f9b74e51c/templates">Here's my version for Zola</a>:</p><pre class="language-t" data-lang="t"><code class="language-t" data-lang="t"><span>{</span><span>% </span><span>set </span><span>section </span><span>= </span><span>get_section</span><span>(</span><span>path</span><span>=</span><span>"_index.md"</span><span>) </span><span>%</span><span>}
  64. </span><span>
  65. </span><span>[
  66. </span><span> {</span><span>%- </span><span>for </span><span>post in </span><span>section</span><span>.</span><span>pages </span><span>-%</span><span>}
  67. </span><span> {</span><span>% </span><span>if </span><span>not </span><span>post</span><span>.</span><span>draft </span><span>%</span><span>}
  68. </span><span> {
  69. </span><span> </span><span>"title"</span><span>: </span><span>{{ </span><span>post</span><span>.</span><span>title </span><span>| </span><span>striptags </span><span>| </span><span>json_encode </span><span>| </span><span>safe </span><span>}}</span><span>,
  70. </span><span> </span><span>"url"</span><span>: </span><span>{{ </span><span>post</span><span>.</span><span>permalink </span><span>| </span><span>json_encode </span><span>| </span><span>safe </span><span>}}</span><span>,
  71. </span><span> </span><span>"body"</span><span>: </span><span>{{ </span><span>post</span><span>.</span><span>content </span><span>| </span><span>striptags </span><span>| </span><span>json_encode </span><span>| </span><span>safe </span><span>}}
  72. </span><span> }
  73. </span><span> {</span><span>% </span><span>if </span><span>not </span><span>loop</span><span>.</span><span>last </span><span>%</span><span>}</span><span>,</span><span>{</span><span>% </span><span>endif </span><span>%</span><span>}
  74. </span><span> {</span><span>% </span><span>endif </span><span>%</span><span>}
  75. </span><span> {</span><span>%- </span><span>endfor </span><span>-%</span><span>}
  76. </span><span>]
  77. </span></code></pre><p>I'm pretty sure that the Jekyll version looks quite similar. <a href="https://learn.cloudcannon.com/jekyll/output-json/">Here's a starting point</a>. If you get something working for your static site generator, <a href="https://github.com/tinysearch/tinysearch/tree/master/howto">please let me know</a>.</p><h2 id="observations"><a class="anchor" href="#observations"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>Observations</h2><ul><li>This is still the wild west: unstable features, nightly Rust, documentation gets outdated almost every day.<br> Bring your thinking cap!</li><li>Creating a product out of a good idea is a lot of work. One has to pay attention to many factors: ease-of-use, generality, maintainability, documentation, and so on.</li><li>Rust is very good at removing dead code, so you usually don't pay for what you don't use. I would still advise you to be very conservative about the dependencies you add to a Wasm binary because it's tempting to add features that you don't need and which will add to the binary size. For example, I used <a href="https://github.com/TeXitoi/structopt">StructOpt</a> during testing, and I had a <code>main()</code> function that was parsing these command-line arguments. This was not necessary for Wasm, so I removed it later.</li><li>I understand that not everyone wants to write Rust code. It's <a href="https://endler.dev/2017/go-vs-rust/">complicated to get started with</a>, but the cool thing is that you can use almost any other language, too. For example, you can write Go code and transpile to Wasm, or maybe you prefer PHP or Haskell. There is support for <a href="https://github.com/appcypher/awesome-wasm-langs">many languages</a> already.</li><li>A lot of people dismiss WebAssembly as a toy technology. They couldn't be further from the truth. In my opinion, WebAssembly will revolutionize the way we build products for the web and beyond. What was very hard just two years ago is now easy: shipping code in any language to every browser. I'm super excited about its future.</li><li>If you're looking for a standalone, self-hosted search index for your company website, check out <a href="https://journal.valeriansaliou.name/announcing-sonic-a-super-light-alternative-to-elasticsearch/">sonic</a>. Also check out <a href="https://github.com/jameslittle230/stork">stork</a> as an alternative.</li></ul><div class="info"><p>✨<strong>WOW!</strong> This tool getting quite a bit of traction lately.✨‍</p><p>I don't run ads on this website, but if you like these kind of experiments, please consider <a href="https://github.com/sponsors/mre/">sponsoring me on Github</a>. This allows me to write more tools like this in the future.</p><p>Also, if you're interested in <strong>hands-on Rust consulting</strong>, <a href="https://github.com/sponsors/mre/sponsorships?sponsor=mre&amp;tier_id=78832">pick a date from my calendar</a> and we can talk about how I can help .</p></div><h2 id="try-it"><a class="anchor" href="#try-it"> <svg viewbox="0 0 24 24" xmlns="http://www.w3.org/2000/svg"> <path d="M0 0h24v24H0z" fill="none"></path> <path d="M3.9 12c0-1.71 1.39-3.1 3.1-3.1h4V7H7c-2.76.0-5 2.24-5 5s2.24 5 5 5h4v-1.9H7c-1.71.0-3.1-1.39-3.1-3.1zM8 13h8v-2H8v2zm9-6h-4v1.9h4c1.71.0 3.1 1.39 3.1 3.1s-1.39 3.1-3.1 3.1h-4V17h4c2.76.0 5-2.24 5-5s-2.24-5-5-5z"></path> </svg> </a>Try it!</h2><p>The code for <a href="https://github.com/mre/tinysearch">tinysearch is on Github</a>.</p><p>Please be aware of these limitations:</p><ul><li><strong>Only searches for entire words.</strong> There are no search suggestions. The reason is that prefix search blows up binary size like <a href="https://www.youtube.com/watch?v=b6u9WJ01Oxs">Mentos and Diet Coke</a>.</li><li>Since we bundle all search indices for all articles into one static binary, I <strong>only recommend to use it for low- to medium-sized websites</strong>. Expect around 4kB (non-compressed) per article.</li><li><strike>The <strong>compile times are abysmal</strong> at the moment (around 1.5 minutes after a fresh install on my machine), mainly because we're compiling the Rust crate from scratch every time we rebuild the index.</strike><br> Update: This is mostly fixed thanks to the awesome work of <a href="https://github.com/CephalonRho">CephalonRho</a> in PR <a href="https://github.com/mre/tinysearch/pull/13">#13</a>. Thanks again!</li></ul><p>The final Wasm code is laser-fast because we save the roundtrips to a search-server. The instant feedback loop feels more like filtering a list than searching through posts. It can even work fully offline, which might be nice if you like to bundle it with an app.</p>