AI News HubLIVE
站内改写5 min read

New SoTA open source TTS model from Boson AI

Boson AI has released Higgs Audio v3 TTS, a 4B parameter state-of-the-art open-source text-to-speech model supporting 100+ languages with zero-shot voice cloning and expressive control. It targets voice chat use cases and is released for research and non-commercial use.

SourceHacker News AIAuthor: silinmeng

","pad_token":"","unk_token":null},"chat_template_jinja":"{%- if tools %}\n {{- 'system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within XML tags:\\n\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n\\n\\nFor each function call, return a json object with function name and arguments within XML tags:\\n\\n{\\\"name\\\": , \\\"arguments\\\": }\\n\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- 'system\\n' + messages[0]['content'] + '\\n' }}\n {%- else %}\n {{- 'system\\nYou are a helpful assistant.\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '' + message.role + '\\n' + message.content + '' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n' }}\n {%- endfor %}\n {{- '\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- 'user' }}\n {%- endif %}\n {{- '\\n\\n' }}\n {{- message.content }}\n {{- '\\n' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- 'assistant\\n' }}\n{%- endif %}\n"},"createdAt":"2026-06-04T05:37:32.000Z","discussionsDisabled":false,"discussionsSorting":"recently-created","downloads":408,"downloadsAllTime":408,"id":"bosonai/higgs-audio-v3-tts-4b","isLikedByUser":false,"availableInferenceProviders":[],"showHuggingChatEntry":false,"inference":"","lastModified":"2026-06-05T01:12:34.000Z","likes":62,"pipeline_tag":"text-to-speech","library_name":"transformers","librariesOther":[],"trackDownloads":true,"model-index":null,"private":false,"repoType":"model","gated":false,"tags":["transformers","safetensors","higgs_multimodal_qwen3","text-generation","text-to-speech","speech-generation","voice-agent","expressive-speech","controllable-tts","multilingual-tts","af","ar","as","ast","az","ba","be","bg","bn","bs","ca","ceb","ckb","cs","cy","da","de","el","en","eo","es","et","eu","fa","fi","fr","ga","gl","gu","ha","he","hi","hr","ht","hu","hy","id","is","it","jv","ka","kab","kam","kea","kk","kn","ko","ky","la","lb","lg","ln","lt","luo","lv","mhr","mi","mk","ml","mn","mr","ms","mt","ne","nl","no","nso","ny","oc","om","pa","pl","ps","pt","ro","ru","rw","sd","sk","sl","sn","so","sq","sr","sv","sw","ta","te","tg","tl","tr","ug","uk","umb","ur","uz","vi","xh","zh","zu","license:other","endpoints_compatible","region:us"],"tag_objs":[{"id":"text-to-speech","label":"Text-to-Speech","type":"pipeline_tag","subType":"audio"},{"id":"transformers","label":"Transformers","type":"library"},{"id":"safetensors","label":"Safetensors","type":"library"},{"id":"af","label":"Afrikaans","type":"language"},{"id":"ar","label":"Arabic","type":"language"},{"id":"as","label":"Assamese","type":"language"},{"id":"ast","label":"Asturian","type":"language"},{"id":"az","label":"Azerbaijani","type":"language"},{"id":"ba","label":"Bashkir","type":"language"},{"id":"be","label":"Belarusian","type":"language"},{"id":"bg","label":"Bulgarian","type":"language"},{"id":"bn","label":"Bengali","type":"language"},{"id":"bs","label":"Bosnian","type":"language"},{"id":"ca","label":"Catalan","type":"language"},{"id":"ceb","label":"Cebuano","type":"language"},{"id":"ckb","label":"Central Kurdish","type":"language"},{"id":"cs","label":"Czech","type":"language"},{"id":"cy","label":"Welsh","type":"language"},{"id":"da","label":"Danish","type":"language"},{"id":"de","label":"German","type":"language"},{"id":"el","label":"Greek","type":"language"},{"id":"en","label":"English","type":"language"},{"id":"eo","label":"Esperanto","type":"language"},{"id":"es","label":"Spanish","type":"language"},{"id":"et","label":"Estonian","type":"language"},{"id":"eu","label":"Basque","type":"language"},{"id":"fa","label":"Persian","type":"language"},{"id":"fi","label":"Finnish","type":"language"},{"id":"fr","label":"French","type":"language"},{"id":"ga","label":"Irish","type":"language"},{"id":"gl","label":"Galician","type":"language"},{"id":"gu","label":"Gujarati","type":"language"},{"id":"ha","label":"Hausa","type":"language"},{"id":"he","label":"Hebrew","type":"language"},{"id":"hi","label":"Hindi","type":"language"},{"id":"hr","label":"Croatian","type":"language"},{"id":"ht","label":"Haitian","type":"language"},{"id":"hu","label":"Hungarian","type":"language"},{"id":"hy","label":"Armenian","type":"language"},{"id":"id","label":"Indonesian","type":"language"},{"id":"is","label":"Icelandic","type":"language"},{"id":"it","label":"Italian","type":"language"},{"id":"jv","label":"Javanese","type":"language"},{"id":"ka","label":"Georgian","type":"language"},{"id":"kab","label":"Kabyle","type":"language"},{"id":"kam","label":"Kamba (Kenya)","type":"language"},{"id":"kea","label":"Kabuverdianu","type":"language"},{"id":"kk","label":"Kazakh","type":"language"},{"id":"kn","label":"Kannada","type":"language"},{"id":"ko","label":"Korean","type":"language"},{"id":"ky","label":"Kyrgyz","type":"language"},{"id":"la","label":"Latin","type":"language"},{"id":"lb","label":"Luxembourgish","type":"language"},{"id":"lg","label":"Ganda","type":"language"},{"id":"ln","label":"Lingala","type":"language"},{"id":"lt","label":"Lithuanian","type":"language"},{"id":"luo","label":"Luo (Kenya and Tanzania)","type":"language"},{"id":"lv","label":"Latvian","type":"language"},{"id":"mhr","label":"Eastern Mari","type":"language"},{"id":"mi","label":"Māori","type":"language"},{"id":"mk","label":"Macedonian","type":"language"},{"id":"ml","label":"Malayalam","type":"language"},{"id":"mn","label":"Mongolian","type":"language"},{"id":"mr","label":"Marathi","type":"language"},{"id":"ms","label":"Malay","type":"language"},{"id":"mt","label":"Maltese","type":"language"},{"id":"ne","label":"Nepali","type":"language"},{"id":"nl","label":"Dutch","type":"language"},{"id":"no","label":"Norwegian","type":"language"},{"id":"nso","label":"Pedi","type":"language"},{"id":"ny","label":"Chichewa","type":"language"},{"id":"oc","label":"Occitan","type":"language"},{"id":"om","label":"Oromo","type":"language"},{"id":"pa","label":"Panjabi","type":"language"},{"id":"pl","label":"Polish","type":"language"},{"id":"ps","label":"Pashto","type":"language"},{"id":"pt","label":"Portuguese","type":"language"},{"id":"ro","label":"Romanian","type":"language"},{"id":"ru","label":"Russian","type":"language"},{"id":"rw","label":"Kinyarwanda","type":"language"},{"id":"sd","label":"Sindhi","type":"language"},{"id":"sk","label":"Slovak","type":"language"},{"id":"sl","label":"Slovenian","type":"language"},{"id":"sn","label":"Shona","type":"language"},{"id":"so","label":"Somali","type":"language"},{"id":"sq","label":"Albanian","type":"language"},{"id":"sr","label":"Serbian","type":"language"},{"id":"sv","label":"Swedish","type":"language"},{"id":"sw","label":"Swahili","type":"language"},{"id":"ta","label":"Tamil","type":"language"},{"id":"te","label":"Telugu","type":"language"},{"id":"tg","label":"Tajik","type":"language"},{"id":"tl","label":"Tagalog","type":"language"},{"id":"tr","label":"Turkish","type":"language"},{"id":"ug","label":"Uyghur","type":"language"},{"id":"uk","label":"Ukrainian","type":"language"},{"id":"umb","label":"Umbundu","type":"language"},{"id":"ur","label":"Urdu","type":"language"},{"id":"uz","label":"Uzbek","type":"language"},{"id":"vi","label":"Vietnamese","type":"language"},{"id":"xh","label":"Xhosa","type":"language"},{"id":"zh","label":"Chinese","type":"language"},{"id":"zu","label":"Zulu","type":"language"},{"id":"higgs_multimodal_qwen3","label":"higgs_multimodal_qwen3","type":"other","clickable":true},{"id":"text-generation","label":"text-generation","type":"other","clickable":true},{"id":"speech-generation","label":"speech-generation","type":"other","clickable":true},{"id":"voice-agent","label":"voice-agent","type":"other","clickable":true},{"id":"expressive-speech","label":"expressive-speech","type":"other","clickable":true},{"id":"controllable-tts","label":"controllable-tts","type":"other","clickable":true},{"id":"multilingual-tts","label":"multilingual-tts","type":"other","clickable":true},{"id":"endpoints_compatible","label":"Inference Endpoints","type":"other","clickable":true},{"id":"license:other","label":"other","type":"license"},{"type":"region","label":"🇺🇸 Region: US","id":"region:us"}],"transformersInfo":{"auto_model":"AutoModelForSeq2SeqLM","pipeline_tag":"text-generation"},"safetensors":{"parameters":{"BF16":4654850537},"total":4654850537,"sharded":false,"totalFileSize":9309834930},"hasBlockedOids":false,"region":"us","isQuantized":false,"licenseFilePath":"LICENSE"},"discussionsStats":{"closed":0,"open":1,"total":1},"query":{},"inferenceContextData":{"billableEntities":[],"entityName2Providers":{}},"hasQuantizations":true,"copyToBucketNamespaces":[]}">

Higgs Audio v3 TTS

Higgs Audio v3 TTS is built for voice chat: it speaks, not just reads. It turns model responses into expressive conversational speech across 100+ languages, with zero-shot voice cloning and inline control over emotion, style, prosody, pauses, and sound effects.

Released for research and non-commercial use under the Boson Higgs Audio v3 Research and Non-Commercial License. Production, hosted APIs, or revenue-generating use requires a separate commercial license. Prohibited: voice cloning without consent, impersonation, fraud, election deception, biometric surveillance, or any unlawful use.

Higgs autoregressive decoder consumes interleaved text and audio tokens. Audio is encoded by the Higgs Tokenizer into 8 codebooks at 25 fps, staggered via a delay pattern, then mapped to backbone hidden states through a multi-codebook fused embedding. Output codes pass through a multi-codebook fused head, are de-delayed, and decoded back to waveform.

Component Spec

Backbone ~4B autoregressive decoder (36 L, hidden=2560, GQA 32/8)

Multi-codebook embedding / head Fused single-tensor, tied with text embedding

Context length 8,192 tokens (training sequence length)

Audio tokens 8 codebooks × 1026 vocab, delay pattern

Sample rate 24 kHz

Frame rate 25 fps (40 ms / frame)

Supported Languages

The model reaches single-digit WER/CER on 102 languages, which split into two tiers.

WER/CER under 5 — polished, production-quality (85)

🇿🇦 Afrikaans · 🇸🇦🇪🇬 Arabic · 🇦🇲 Armenian · 🇮🇳 Assamese · 🇪🇸 Asturian · 🇦🇿 Azerbaijani · 🇷🇺 Bashkir · 🇪🇸 Basque · 🇧🇾 Belarusian · 🇧🇩🇮🇳 Bengali · 🇧🇦 Bosnian · 🇧🇬 Bulgarian · 🇪🇸 Catalan · 🇵🇭 Cebuano · 🇮🇶 Central Kurdish · 🇨🇳 Chinese · 🇭🇷 Croatian · 🇨🇿 Czech · 🇩🇰 Danish · 🇳🇱🇧🇪 Dutch · 🇷🇺 Eastern Mari · 🇺🇸🇬🇧🇦🇺 English · 🌐 Esperanto · 🇪🇪 Estonian · 🇫🇮 Finnish · 🇫🇷🇨🇦 French · 🇪🇸 Galician · 🇬🇪 Georgian · 🇩🇪🇦🇹 German · 🇬🇷 Greek · 🇮🇳 Gujarati · 🇭🇹 Haitian Creole · 🇳🇬 Hausa · 🇮🇱 Hebrew · 🇮🇳 Hindi · 🇭🇺 Hungarian · 🇮🇩 Indonesian · 🇮🇹 Italian · 🇯🇵 Japanese · 🇮🇩 Javanese · 🇮🇳 Kannada · 🇰🇿 Kazakh · 🇰🇷 Korean · 🇷🇼 Kinyarwanda · 🇰🇬 Kyrgyz · 🇱🇻 Latvian · 🇨🇩 Lingala · 🇱🇹 Lithuanian · 🇰🇪 Luo · 🇲🇰 Macedonian · 🇲🇾🇮🇩 Malay · 🇮🇳 Malayalam · 🇲🇹 Maltese · 🇳🇿 Māori · 🇮🇳 Marathi · 🇲🇳 Mongol

[truncated for AI cost control]