I’m currently shopping around for something a bit faster than ollama and because I could not get it to use a different context and output length, which seems to be a known and long ignored issue. Somehow everything I’ve tried so far did miss one or more critical features, like:
- “Hot” model replacement, so loading and unloading models on demand
- Function calling
- Support of most models
- OpenAI API compatibility (to work well with Open WebUI)
I’d be happy about any recommendations!
Are you using a tiny model (1.5B-7B parameters)? ollama pulls 4bit quant by default. It looks like vllm does not used quantized models by default so this is likely the difference. Tiny models are impacted more by quantization
I have no problems with changing num_ctx or num_predict
It was multiple models, mainly 32-70B
Can you try setting the
num_ctx
andnum_predict
using a Modelfile with ollama? https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameterI’ve read about this method in the GitHub issues, but to me it seemed impractical to have different models just to change the context size, and that was the point I started looking for alternatives
You can overwrite the model by using the same name instead of creating one with a new name if it bothers you. Either way there is no duplication of the llm model file