benjamin-paine's picture
Switch to kokoro
66fa322
raw
history blame
12.7 kB
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Anachrovox</title>
<!-- Cloud assets -->
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Roboto:ital,wght@0,100;0,300;0,400;0,500;0,700;0,900;1,100;1,300;1,400;1,500;1,700;1,900&family=Meie+Script&family=Lekton:ital,wght@0,400&display=swap" rel="stylesheet">
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/ort.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/taproot.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/hey-buddy.min.js"></script>
<!-- Local assets -->
<link rel="stylesheet" href="index.css">
</head>
<body>
<main>
<section id="transcription">
<div id="bezel"></div>
<div class="reflection dodge"></div>
<div class="reflection"></div>
<div id="content">
<div id="welcome">Welcome to Anachrovox. Speak your command to 'Vox' or type it in below.</div>
<div id="history"></div>
<div id="input">
<textarea id="prompt" rows=3></textarea>
</div>
</div>
</section>
<section id="waveform">
<div class="circle-reflection dodge"></div>
<div class="circle-reflection"></div>
<div class="oscilloscope-grid"></div>
<canvas width=256 height=256></canvas>
<div class="circle-shading"></div>
</section>
<section id="voice">
<input
id="voice-id"
class="solari-board"
type="text"
value=""
maxlength="7"
/>
<input
id="voice-id-wheel",
class="wheel"
type="button"
/>
<input
id="volume"
class="dial"
type="number"
min="0"
max="1"
step="0.01"
value="1.0"
data-angle-start="-45"
data-angle-end="45"
data-labels='{"0":"Off", "1":"Max"}'
/>
<input
id="speed"
class="dial"
type="number"
min="0.75"
max="1.75"
step="0.01"
value="1.25"
data-angle-start="-90"
data-angle-end="90"
data-labels='{"0.75":"Slw", "1.25":"Nrm", "1.75":"Fst"}'
/>
</section>
<section id="brain">
<input
id="top-k-display"
type="number"
class="seven-segment"
min=1
max=100
value=40
/>
<input
id="top-k"
class="dial decoupled"
type="number"
min=1
max=100
step=1
value=40
/>
<input
id="min-p"
class="slider"
type="number"
min=0
max=1
step=0.01
value=0.5
data-labels=6
/>
<input
id="top-p"
class="slider"
type="number"
min=0
max=1
step=0.01
value=0.95
data-labels=6
/>
<input
id="temperature"
class="slider"
type="number"
min=0
max=1
step=0.01
value=0.20
data-labels=6
/>
</section>
<section id="microphone">
<div id="listen-button">
<button id="listen"></button>
<label>Call</label>
</div>
<div id="speaker"></div>
<div id="power-switch">
<input
id="power-switch-input"
type="checkbox"
class="power"
checked
/>
</div>
<div class="bulb-indicator" id="recording"><label>RECORD</label></div>
<div class="bulb-indicator" id="listening"><label>LISTEN</label></div>
<div class="bulb-indicator" id="power"><label>POWER</label></div>
</section>
<section id="logo">
Anachrovox
</section>
<section id="credits">
Released under the <a href="https://www.apache.org/licenses/LICENSE-2.0.txt" target="_blank">Apache License v2.0</a> in 2024 by <a href="mailto:[email protected]">Benjamin Paine</a>.
<div class="smaller">Anachrovox can make mistakes. Check all outputs.</div>
</section>
</main>
<aside id="info">
<button id="info-button">i</button>
<div id="info-content">
<h1 id="info-logo">Anachrovox</h1>
<h2 id="info-version">Alpha Release v0.1.3</h2>
<h3>Instructions</h3>
<p>To issue a voice command in a hands-free fashion, start your command with one of the supported wake phrases, then issue your command naturally (i.e. you do not need to pause.) There are many supported wake phrases but they all include 'Vox' - for example, 'Hey Vox', 'Hi Vox', or just 'Vox'.</p>
<p>There are numerous ways to shortcut the speech-to-speech workflow:</p>
<ul>
<li>Turning the volume all the way down will disable the text-to-speech step, effectively creating a text-only mode.</li>
<li>Directly type your query into the text box and press enter to issue commands without needing to speak.</li>
<li>Press and hold the call button to issue a voice command without needing to use a wake phrase.</li>
</ul>
<h3>About</h3>
<p>Anachrovox is a real-time voice assistant that combines two of my other projects:</p>
<ol>
<li><a href="https://github.com/painebenjamin/taproot" href="_blank">Taproot</a>, a scalable and lightning-fast task-centric backend inference engine, enabling easy installation and deployment of models for speech recognition, natural language understanding, and text-to-speech synthesis.</li>
<li><a href="https://github.com/painebenjamin/hey-buddy" href="_blank">Hey Buddy</a>, a real-time in-browser audio wake-word detector and training library to listen for wake phrases to trigger the assistant, enabling hands-free, always-on voice interaction without the need to use the backend until specifically requested.</li>
</ol>
<p>This is an <strong>alpha</strong> release of both Anachrovox and the underlying libraries, so bugs in all aspects of operation are expected. As it uses a large language model at it's heart, you should take the same precautions as you would with any other language model, such as not using it for sensitive tasks and not trusting it's output as fact without verification.</p>
<h3>High Availability</h3>
<p>In addition to Taproot's low-overhead design, it is designed to be clustered and highly available. This means that you can run multiple instances of Anachrovox and they will automatically load-balance and fail-over between each other, ensuring that the assistant is always available and responsive. You will need to perform some networking and configuration to enable this, see the GitHub repository for more information. If you are using Anachrovox on one of the official HuggingFace spaces, this is happening automatically between them.</p>
<h3>Features</h3>
<p>The main feature of Anachrovox is real-time, hands-free speech-to-speech large language model interaction. This is achieved through the following components, all of which are open-source and available for use in your own projects:</p>
<ol>
<li>Wake-word detection using <a href="https://github.com/painebenjamin/hey-buddy" target="_blank">Hey Buddy</a></li>
<li>Speech-to-text using <a href="https://huggingface.co/distil-whisper/distil-large-v3" target="_blank">Distil-Whisper Large</a></li>
<li>Text generation using <a href="https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct" target="_blank">Llama 3.2 3B Instruct</a></li>
<li>Text-to-speech using <a href="https://huggingface.co/hexgrad/Kokoro-82M" target="_blank">Kokoro</a></li>
<li>Audio enhancement using <a href="https://github.com/Rikorose/DeepFilterNet" target="_blank">DeepFilterNet</a></li>
</ol>
<p>All backend models are ran through Taproot, which is made to be as low-overhead as possible, allowing for real-time operation on consumer hardware. These are just a small selection of the supported model set, but were chosen for their balance of speed, size and capability. Visit <a href="https://github.com/painebenjamin/anachrovox" target="_blank">the Anachrovox GitHub</a> to see how to build with different supported components and/or visit <a href="https://github.com/painebenjamin/taproot" target="_blank">the Taproot GitHub</a> to request a new supported component or learn how to build your own (and hopefully contribute it back to the community!)</p>
<p>To improve the usefulness of the assistant, the following tools are available to use. These are invoked conversationally, either when you ask directly or sometimes when the assistant thinks it can help.</p>
<ul>
<li>DuckDuckGo news headlines, blurb search, and deep-dive.</li>
<li>Wikipedia search</li>
<li>Fandom (formerly Wikia) search</li>
<li>Date and time by timezone or location</li>
<li>Current and forecasted weather by location</li>
</ul>
<h3>Citations</h3>
<cite>@misc{gandhi2023distilwhisper,
title = {Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling},
author = {Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
year = {2023},
eprint = {2311.00430},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}</cite>
<cite>@misc{dubey2024llama3herdmodels,
title = {The Llama 3 Herd of Models},
author = {Llama Team, AI @ Meta},
year = {2024}
eprint = {2407.21783},
archivePrefix = {arXiv},
primaryClass = {cs.AI},
url = {https://arxiv.org/abs/2407.21783}
}</cite>
<cite>@misc{kokoro82m,
title = {Kokoro-82M}
author = {@rzvzn}
year = {2024}
url = {https://huggingface.co/hexgrad/Kokoro-82M}
}</cite>
<cite>@inproceedings{schroeter2023deepfilternet3,
title = {{DeepFilterNet}: Perceptually Motivated Real-Time Speech Enhancement},
author = {Schröter, Hendrik and Rosenkranz, Tobias and Escalante-B., Alberto N. and Maier, Andreas},
booktitle = {INTERSPEECH},
year = {2023},
}</cite>
</div>
</aside>
<a id="github" href="https://github.com/painebenjamin/anachrovox" target="_blank" title="GitHub"><img src="static/github.svg" alt="GitHub"></a>
</body>
<script type="module" src="inputs.js"></script>
<script type="module" src="index.js"></script>
</html>