hok

joined 1 year ago
[–] hok@lemmy.dbzer0.com 14 points 2 days ago (6 children)

Curious, how do you evaluate the performance without breaking the law?

 

I want to fine tune an LLM to "steer" it in the right direction. I have plenty of training examples in which I stop the generation early and correct the output to go in the right direction, and then resume generation.

Basically, for my dataset doing 100 "steers" on a single task is much cheaper than having to correct 100 full generations completely, and I think each of these "steer" operations has value and could be used for training.

So maybe I'm looking for some kind of localized DPO. Does anyone know if something like this exists?

[–] hok@lemmy.dbzer0.com 3 points 2 weeks ago

You are right. Their description of "SOTA Open Source TTS" caused me to assume it was open source, but it's clear that

This codebase and all models are released under CC-BY-NC-SA-4.0 License.

So, it's "source available" and not released under a permissive licence.

 

People are talking about the new Llama 3.3 70b release, which has generally better performance than Llama 3.1 (approaching 3.1's 405b performance): https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3

However, something to note:

Llama 3.3 70B is provided only as an instruction-tuned model; a pretrained version is not available.

Is this the end of open-weight pretrained models from Meta, or is Llama 3.3 70b instruct just a better-instruction-tuned version of a 3.1 pretrained model?

Comparing the model cards: 3.1: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md 3.3: https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md

The same knowledge cutoff, same amount of training data, and same training time give me hope that it's just a better finetune of maybe Llama 3.1 405b.

[–] hok@lemmy.dbzer0.com 4 points 2 weeks ago* (last edited 2 weeks ago)

I followed their instructions here: https://speech.fish.audio/

I am using the locally-run API server to do inference: https://speech.fish.audio/inference/#http-api-inference

I don't know about other ways. To be clear, this is not (necessarily) an LLM, it's just for speech synthesis, so you don't run it on ollama. That said I think it does technically use Llama under the hood since there are two models, one for encoding text and the other for decoding to audio. Honestly the paper is terrible but it explains the architecture somewhat: https://arxiv.org/pdf/2411.01156

 

I've been waiting for an open source TTS model that was actually good enough to capture some of the subtleties of language and synthesize them in a natural-sounding way that makes sense. I think I finally found one that fits the requirements.

Model: https://huggingface.co/fishaudio/fish-speech-1.5

It uses an encoder rather than relying on phonemes, and generations sometimes vary because of that, but the amount of errors I've gotten are minimal, and the variations in the generation are all surprisingly natural in slightly different ways, which is very exciting.

Give it a spin if you are also looking for a TTS model that sounds good. It uses voice cloning, so find a good 10-20 second reference clip to have the generations use the same voice.

[–] hok@lemmy.dbzer0.com 15 points 2 weeks ago

On Lemmy, everything is a bit leftist at the moment.

 

I'd like to fine tune a model that does img2img with a text prompt to guide the output. I think img2img-turbo might be the closest to what I'm after, though by default it uses a fixed prompt which can be made variable with some tweaking of the training code.

At the moment I only have access to 24GB VRAM which limits my options. What I'm after is training a model to make specific text-based modifications to images, and I have plenty of before to after images plus the modification text prompts to train on. Worst case, I can try to see if reducing the image size during training makes it possible with my setup.

Are there any other options available today?