Text to Speech
LearnKit can read flashcard content aloud using your device’s built-in speech engine or a cloud TTS provider.
Configure it in Settings -> Audio.
Main controls
Section titled “Main controls”The current Audio tab includes:
| Setting | What it does |
|---|---|
| Text to speech | Turns TTS on for study |
| Limit to group | Reads aloud only cards in one group, if set |
| Autoplay | Reads the question automatically, then the answer when revealed |
| Widget read-aloud | Enables replay buttons and speech in the study widget |
| Gatekeeper read-aloud | Enables replay buttons and speech in Gatekeeper |
| Cloze read mode | Reads either just the answer or the full sentence |
| Default voice | Chooses the accent and dialect for Latin-script text |
| Advanced options | Chooses fallback languages for Arabic, Chinese, Cyrillic, and Devanagari scripts |
| Speech rate / Speech pitch | Tunes playback |
| Preview voice | Plays a test sample |
Cloud TTS providers
Section titled “Cloud TTS providers”When Use external TTS provider is enabled, LearnKit sends text to a cloud API instead of the device speech engine. This produces higher-quality, more natural audio.
Supported providers
Section titled “Supported providers”| Provider | Model(s) | Flag-based language | Notes |
|---|---|---|---|
| OpenAI | GPT-4o Mini TTS | ✓ | Steerable, expressive voice with silent accent control. |
| ElevenLabs | Multilingual v2, Turbo v2.5 | ✓ | 29–32 languages supported. |
| Google Cloud | Default (voice-native) | ✓ | Language set from flag code or voice selection. |
| Custom | User-defined | — | Any HTTP endpoint that returns audio. Fully user-configured. |
How language detection works
Section titled “How language detection works”When Use flags for language and accent is enabled, inline flag codes such as {{es}} or {{fr-ca}} are detected automatically. The detected language is then:
- OpenAI: Sent as an instruction telling the model to speak with native pronunciation (not spoken aloud).
- ElevenLabs: Sent as a
language_codeparameter so the multilingual model pronounces text correctly. - Google Cloud: Sent as the
languageCodein the voice configuration.
This means a card with {{es}} Sensible on the question and {{en}} Sensible on the answer will be read with a Spanish accent and an English accent respectively — even though the word is identical.
Provider settings
Section titled “Provider settings”Each provider shows:
| Setting | Description |
|---|---|
| API key | Your provider API key. Stored locally, never synced. |
| Voice | Provider-specific voice. Known providers show a searchable list; custom uses a text field. |
| Model | Provider-specific model. Only multilingual models are offered. |
| Endpoint URL | Shown only for the Custom provider. |
Caching
Section titled “Caching”Cache generated audio saves MP3 files locally in the plugin folder so repeated reviews do not make duplicate API calls.
How caching works
Section titled “How caching works”Each card side (question and answer) gets its own cached audio file. The cache key is based on the card ID and field side, not the text content — so a cached file is reused as long as the card exists and has not been edited.
For cloze cards, cloze read mode (Just the answer vs Full sentence) also factors into the cache key. Switching mode triggers a new API request.
For MCQ cards, the shuffled display order is encoded into the cache key. If the option order changes between reviews (e.g. because randomisation is enabled), a new API call is made for that ordering.
When cached audio is deleted
Section titled “When cached audio is deleted”All cached audio files for a card are deleted whenever any field on that card is edited — regardless of which field changed. This covers the question, answer, cloze text, options, title, and info fields. Editing a single field clears every cached side (question, answer, options, etc.) so that stale audio is never replayed.
Cache files are also cleaned up automatically when cards are deleted or updated during sync.
You can manually clear the entire cache from Settings -> Audio -> Clear TTS cache.
Cross-device sync
Section titled “Cross-device sync”Cached audio files are stored inside the plugin data folder (.obsidian/plugins/learnkit/tts-cache/). This folder is not synced by Obsidian Sync or most cloud sync tools, so each device builds its own cache independently. The first playback of a card on a new device will make a fresh API call.
What TTS can read
Section titled “What TTS can read”TTS is designed for text-based flashcards such as:
- Basic and reversed flashcards
- Cloze flashcards
- Multiple choice questions
- Ordered questions
Image occlusion is a visual format, so it should not be treated as a full TTS-first workflow.
Cloze read mode
Section titled “Cloze read mode”Cloze read mode has two options:
Just the answerreads only the missing term.Full sentencereads the sentence with the answer filled back in.
If a cloze uses hint syntax such as {{c1::Paris::city}}, LearnKit reads the hint on the hidden side and the full answer on reveal.
Flag-aware routing
Section titled “Flag-aware routing”The Flag-aware routing section lets inline flags control language and accent during playback.
Use flags for language and accentlets tokens like{{es}}or{{es-mx}}change voice selection.Announce language namespeaks the language name before each flag-switched segment.
See Flags and Flag-Codes for the token system.
Voice availability
Section titled “Voice availability”LearnKit uses system voices loaded through the Web Speech API. Voice quality depends on your platform and installed voices.
- macOS and iOS usually have the best built-in voice selection.
- Windows quality depends on which Microsoft voices are installed.
- Linux depends heavily on local speech packages.
If the Audio tab shows no available system voices yet, reopen the tab and let the voice list load again.
Practical tips
Section titled “Practical tips”- Set the default voice to the language you review most often.
- Use advanced script options when the same script could map to multiple languages.
- Combine TTS with flags when a single card mixes languages or accents.
- Keep group-limited TTS for language subsets if you do not want every card read aloud.
Last modified: 24/04/2026