March 2, 2024
Let me start by saying I am no expert in this field whatsoever, but here's my experience documented for my convenience, on how to fine tune a LLM for constricted output generation using a predefined JSON schema.
First of all I would like to highlight some of the resources I have refereed to. I believe you might understand my POV better if you had a read of them, yourself.
I am not going to dwell deep into what fine tuning options are available to users, what are the downsides and upsides since there are already resources out there that can help you with that.
Let me tell you what I found convenient. Transformer's TRL library is really powerful and abstracts away a lot, which is convenient because the developer can focus on what matters the most, the generation quality of the responses. I found QLora to be more efficient compared to Lora, while giving me better results in my opinion.
What was the most inconvenient was on how different CUDA versions messed with the fine tuning pipeline. So make sure that the CUDA version is at least 12.1 . By the time you read this the dependency packages might have updated as well therefore its a safe bet to read their documentation, go to huggingface forum and see if anyone else have come up with any issues before proceeding, because if you don't I guarantee you that you are going to waste least half a day trying to find what went wrong.
Most online resources recommend using the EOS token to be set as the padding token but this doesn't perform constricted generation as the model will keep on generating JSONs even after rendering the EOS token. What worked for me was to use the UNK token to be set as the padding token. If you have no idea what these tokens are, just have a look at the special_tokens_map.json
file.
Now you might ask why did I want to fine tune the model to generate responses according to a certain schema, its just that I didn't know at the time that there were other methods of obtaining JSON based outputs from models like grammar. However I found out about jsonformer but it gave little to no flexibility for me.
Frankly, I was hoping to perform the fine tuning for at least 3 epochs and I did, but ended up finding that the model looses the ability for constricted generation after a single epoch. Unfortunately I have no idea whatsoever on what might be the issue. Therefore, I just settled at fine tuning for just a single epoch. But frankly speaking this is just fine tuning and a single epoch is more than enough for a simple fine tune.
Now let's get to the juicy part. The model was able to analyze a given text and generate, reasonable and sound MCQs. Give it any type of text and you will receive a question, but it depends on how contextually dependent the question might be. For an example, it doesn't know how to generalize context before generating a question. It just simply generates with what was given as the context. However, another interesting fact is that the questions were deterministic. Which means entering the same sequence of text would yield in the same response. Therefore I tried to fiddle with some generation parameters like temperature and top_k but to my surprise these changes made model responses inaccurate or just gibberish random tokens.
I was able to use AWQ algorithm to quantize the model and speed up inference time significantly. However AWQ also requires specific CUDA versions, explicitly something above 12.1. The model could handle up to 8 batches of input while generating accurate responses, but since that takes a large memory footprint I decided to do single generation.
The model was hosted on Runpod as a RESTful API endpoint. Runpod provides a customizable docker template which abstracts away the endpoint implementation and is given with a single handler function to handle requests.
Since this was my first fine tune attempt, I thoroughly enjoyed it. I can tell you what wasn't enjoyable, which is the fact that I am GPU Poor.
My source code is available through the Github repository, I hope that can be of value to you.
Thanks for reading. See you on the next One.