How To Deploy Mistral 7B
November 21, 2023How to Deploy Mistral 7B
Your users deserve stability
In the wake of the OpenAI leadership shakeup, many AI teams close to us are wondering if they should be exploring alternatives outside of GPT-4 and managed APIs.
For teams who depend on API providers to serve their product, diversifying to open source models can be very tempting, since owning the API can help mitigate business risk and keep your AI business thriving.
We’re no strangers to this, and are here to help!
Banana is a serverless GPU platform for inference, and is a great platform for deploying and hosting open source LLMs as your own APIs. Many teams graduating from APIs find Banana to be an easy stepping stone from third-party APIs, because we take care of the infrastructure and autoscaling while allowing you to fully own your API code and model weights.
Mistral 7B is often considered the best open source alternative to OpenAI’s GPT-3.5 and GPT-4. Today we’re going to show you how to deploy it.
Why Mistral 7B
- The Mistral 7B model outperforms Llama 2 13B and approaches CodeLlama 7B in code performance.
- It eses Grouped-Query Attention (GQA) for faster inference and Sliding Window Attention (SWA) for handling longer sequences efficiently.
- Uses the Apache 2.0 software license, an extremely permissive license, allowing deployment across various platforms, including local setups and cloud services like Banana.
- Is quite easy to fine-tune. (Excellent blog here by the way!)
Deploying to Banana
Cover the Requirements
You’ll need:
- python3
- a github account
Optional, but helpful for testing
- docker
- a GPU
Optional, but healthy:
- a banana (the fruit, for eating. full of potassium!)
Create your GitHub Repo
Banana works off of Github and Docker. Whenever you push to “main” in your github repo, Banana will build the Dockerfile in the repo and deploy that to a live production endpoint.
This makes it easy for us to get Mistral running. We’ve implemented the Mistral 7B model for you already, using TheBloke’s quantized version from Huggingface, so if you’re not finetuning you can simply fork the repo and deploy it without modification!
- Go to the Github Repo Mistal 7b Demo
- Click “Use this Template” to fork it into your account
Create a Banana Account
Open up app.banana.dev to set up your account.
Then click “Deploy from Github” in the sidebar
And then “Manage Github Integration” in the top right:
This will direct you into the Github app, where you will authorize Banana to connect to your repo. You’re free to install the app to your whole github org, or you can scope it to the specific repo you just created.
Deploy this new repo
You should now see the repo you forked in Banana. Click “Deploy” to build and deploy it.
It will build from the Dockerfile. During this stage, the download.py file will be ran to pre-cache the Mistral weights into the container.
While we wait for this to build and deploy, let’s familiarize ourselves with code.
Serverside
- app.py is the primary home of your app.
- It’s built in Banana’s serving framework Potassium.
- We load the Mistral model in the init() block, to keep it warm in memory for subsequent calls.
- In the http handler() block, we implement the inference with http response streaming, yielding each generated token as a json object.
- Dockerfile is the docker build file for your app. This is ultimately what is built when you push to your repo, so you can make your Dockerfile environment as custom as you want.
- download.py is a helper file for your build, to precache the Mistral weights into the build
Clientside
- client.js is a JavaScript example of how to receive the stream from the server.
Call it in production!
Your project should now look like this:
To call it, we’ll be using client.js from the repo. You’ll need to make two updates to it:
- Add your Project URL in the URL field of the options
- Add your API Key (found on the Dashboard) of Banana
And you’ll need to install the package requirements
npm install
And finally, we call the production endpoint!
node client.js
On your first call you should see a cold boot. About 2-3 minutes is expected for a model of this size, though there are active projects at Banana to drive these times down significantly.
Once the replica is warm, it will start streaming its response back to the client:
Notice that if you keep calling the replica, it will be warm and ready to go for you, finishing the whole stream response within a few seconds. The replica will stay warm for a minute of idle time before shutting down, but you're free to configure this in your project's Settings tab.
That’s all folks!
On Banana, you’re in full control of the code being ran. By deploying your first Mistral 7B model, you’re taking a huge step toward the world of self-hosted open source models, and a significant step toward owning your own destiny.
But the journey is never done! Deploying models is a constant iteration of tweaks and finetunes. Thankfully, you’ve got a github repo to keep iterating on, so go wild, and let us know if you need any more support.
You can reach me at [email protected], or join our Discord for community support.