LLMs
2023-11-27


Large Language Models popped off with the release of ChatGPT, a service that launched almost exactly a year ago as a research preview at the end of November 2022 and quickly took internet by the storm. It gained over a million users in just a few days, and reached over a 100 mln users in a few weeks. They've caused a lot of fuzz which definitely accelerated the evolution of AI, and we've seen a lot of developments in that space over just this last year. While there is also a lot of valid criticism regarding AI, and plenty that I could pour out in that direction myself, LLMs sure are a damn useful tool. In simple terms, LLM provides text completions by predicting what the next word should be based on what it has seen previously - in mainly the context given, but also the training data. It would be pretty cool if we could run our own local instance of some open-source LLM, right?

Unfortunately there isn't much open-source in OpenAI's LLM solution like ChatGPT, but it will still serve a helpful example to better understand how it all works. With ChatGPT we use a web frontend for interaction with the language model like GPT-3.5 or GPT-4. You can also interact with available models through the API. GPT-3.5 itself is "a series of models" that consists of: `code-davinci-002` (base model), `text-davinci-002` (InstructGPT model based on the previous one), `text-davinci-003` (an improvement over the previous one), `gpt-3.5-turbo` (another improvement, optimized for chat). InstructGPT and other chat models are useful because of the fine-tuning towards instruction following and proper AI alignment. After the model has been trained for valid completions, there's an additional training step to teach it the pattern for interaction - that step is called fine-tuning. With open-source LLMs people can fine-tune the model to their specific need. The AI alignment is a preprogrammed bit of context that tells the model how to behave. In case of ChatGPT it is the invisible prompt that tells the model to play the role of a helpful chat assistant. All of this together makes LLMs useful for various tasks besides just text completions (which was needed for that software to take over mainstream). Without alignment the base GPT will not provide a dialogue by itself, but could be used successfully for code completion tasks. Below the interface, whether WebUI or API, there is a plethora of models to choose from, and that's the most interesting part. The "deprecations" page in OpenAI docs lists a whole graveyard of GPT-3 models with the naming convention of "[class]-{ada,babbage,curie,davinci}-[suffix]". They can all behave slightly differently. Following OpenAI, other tech giants such as G00gle, Micro$$oft, and Faceb00k have been toying with this tech too. A particularly interesting model was LLaMA by Meta AI (Faceb00k).

Open-sourcing LLMs started off with LLaMA, which was leaked on 4chan soon after release. Source code for inference was already published, but the leak contained model's weights (the result of model learning) which allowed the community to skip the access application process with the corp. It was easier for people to run it on their own machines and to fine-tune the model further. The significantly smaller 13B parameter model of LLaMA was exceeding the 175B parameter model of GPT-3 in performance. As with all things open-source, advancements were quickly made by the community. This laid ground for the de-facto standard platform of open-source LLMs. Llama2 was released a few months later, you can learn all about it here: https://ai.meta.com/llama/get-started/

This is how Facebook itself praises perks of open-source:
  - Crowd sourced optimization: The open source community has really embraced our models. To date, the community has fine-tuned and released over 7,000 derivatives on Hugging Face. On average, across standard benchmarks, these have improved performance on common benchmarks by nearly 10% with remarkable improvements of up to 46% for benchmark datasets like TruthQA.
  - Developer community: There are now over 7,000 projects on GitHub built on or mentioning Llama. New tools, deployment libraries, methods for model evaluation, and even “tiny” versions of Llama are being developed to bring Llama to edge devices and mobile platforms. Additionally, the community has expanded Llama to support larger context windows, added support for additional languages, and so much more.
  - Hardware support: The hardware community has fully embraced Llama as a key model architecture. Major hardware platforms AMD, Intel, Nvidia, and Google have boosted the performance of Llama 2 through hardware and software optimizations.

One of the platforms mentioned above is Hugging Face, which hosts a huge library of different models and datasets. Some of the models that I think might be worth checking out are: Mistral, Vicuna, WizardLM. HuggingFace also has a model leaderboard, so you can compare different open-source LLMs and pick something out by yourself: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

Besides fine-tuning the model and feeding it a good prompt, there there are still some parameters that you can change when you run the model. Check out this article for an idea: https://rentry.org/llm-settings

When you do finally run the model, really the most important thing to make good use of it is feeding it effective prompts. The model will respond better or worse depending on the way you express yourself. This sparked interest in "prompt engineering" - figuring out how to communicate with ready-made models effectively to make them do the right job and achieve desired outcomes. This could be very useful to you as an operator.

https://platform.openai.com/docs/guides/prompt-engineering/
https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/
https://www.promptingguide.ai/
https://learnprompting.org/docs/intro

So, finally, how do you actually run this stuff?
With something like ollama (https://ollama.ai/) or LocalAI (https://localai.io/), it's pretty simple and painless. You can easily interact with the LLM from your terminal or the HTTP API that is usually similar to the one from OpenAI (so maybe check out their docs too?). Both of these open-source projects have decent documentation.

From the website of LocalAI:
> LocalAI acts as a drop-in replacement REST API that’s compatible with OpenAI API specifications for local inferencing. It allows you to run LLMs, generate images, audio (and not only) locally or on-prem with consumer grade hardware, supporting multiple model families that are compatible with the ggml format. Does not require GPU.

An example of installing and running `ollama` with 'mistral' model:
```
# pacman -S ollama
$ ollama serve &
$ ollama pull mistral
$ ollama run mistral
```

You can also install a local WebUI and plug into your locally deployed API. There are plenty open-source options to choose from, for example big-AGI (https://big-agi.com/) or Cheshire Cat (https://cheshirecat.ai/).

Where to go next? Here are two deligthful pieces with a ton of links to other sources:
https://www.borealisai.com/research-blogs/a-high-level-overview-of-large-language-models/
https://flyte.org/blog/getting-started-with-large-language-models-key-things-to-know

No AI was harmed in the making of this blog post.