A series of Open Source Bilingual Kannada-English Large Language Models
In this blog, I am thrilled to share insights into the meticulous approach we undertook to train Amabri Base and Amabri Instruct. Offering a high-level glimpse into our process, this narrative serves as a precursor to the forthcoming revelation of all technical details— the culmination of extensive testing and evaluation. Stay tuned as we unravel the intricacies that led to the creation of Amabri, an innovative open-source bilingual Kannada-English Large Language Model.
In the dynamic landscape of Large Language Models (LLMs), the creation of Ambari stemmed from a multifaceted purpose:
As LLMs increasingly permeate mainstream usage, open-source models, while enriched in world knowledge, predominantly emerge from English-centric training. Amabri serves as a pioneering initiative to broaden this scope and adapt LLMs to diverse languages.
In the evolving landscape of LLMs, the demand for vast amounts of training data, ranging from 1 trillion to 10 trillion tokens, has become a norm. However, this poses a challenge for languages with limited documented resources. In our pursuit, we focused on the adaptation of a pre-trained LLM, such as Llama/Mistral, to comprehend the nuances of a new language—Kannada in the case of Amabri. Despite Kannada not being classified as a very low-resource language, it served as an ideal candidate to test our hypothesis and methodologies. Rigorously defining the stages of training and finetuning, we set a cap of 1 billion training tokens for the entire process.
Subsequently, we meticulously crafted datasets, distributed them accordingly, and delineated the stages of our process:
This deliberate approach laid the foundation for Amabri's development, pushing the boundaries of language adaptability within the realm of LLMs.
Tokenization, a critical component in the efficiency of language models, posed a unique challenge for Kannada text within the context of open-source LLMs. Many existing models inefficiently resort to character-level tokenization, especially during inference, impacting overall performance. To address this, we developed a specialized tokenization model for Kannada text using SentencePiece. This model was seamlessly integrated with the base Llama tokenizer, resulting in a comprehensive vocabulary of 49,600 , expanded by 17,600 .
Our approach involved training the tokenizer model on three different dataset sizes, revealing optimal results with a dataset comprising 100,000 tokens. As we evolve Amabri, the upcoming iteration will feature a refined tokenization strategy, employing a reduced vocabulary size of 48,000. This adjustment, validated by insights shared by Andrej Karpathy in his Twitter post (Andrej Karpathy on Twitter), is geared towards enhancing overall efficiency.
With an efficient tokenizer in place, our next crucial step was the pre-training phase, aimed at familiarizing the model with the newly enriched vocabulary. To optimize this process, we curated a comprehensive dataset from diverse sources. Notably, we explored two distinct approaches during this phase—pre-training with Lora and fully training the model. This strategic decision stemmed from our desire to discern the optimal path for Amabri's development.
A detailed comparison between these methodologies will be unveiled shortly, but we've gleaned some initial observations:
While we acknowledge that our ongoing testing may refine these observations, this snapshot provides valuable insights into our progress. The pre-training phase employed a cluster of 2xA100 GPUs, taking approximately 25 hours for full-weight pre-training on a substantial corpus comprising 500 million tokens.
It's worth mentioning that the weights of the fully fine-tuned model are now available on Hugging Face🤗 - https://huggingface.co/Cognitive-Lab/Ambari-7B-base-v0.1, contributing to the open-source knowledge sharing within the community.
This phase, inspired by the open Hathi series by sarvam.ai, was an unplanned yet pivotal addition to our training strategy. Creating a dataset of 200,000 tokens, we utilized Lora for fine-tuning, aiming to equip the model with enhanced language understanding. As we progressed, our focus shifted towards instilling 'world knowledge' in Kannada. Given the scarcity of Kannada content, especially compared to English, we turned to translation. Leveraging IndicTrans2, we translated English content, primarily sourced from Wikipedia, into Kannada. However, instead of conventional monolingual next token prediction, we introduced a groundbreaking approach — bilingual next token prediction. Alternating sentences between Kannada and English, this method compelled the model to cross-lingually attend to information during next-token prediction. This nuanced approach not only fostered increased alignment between Kannada and English but also naturally balanced exposure to Hindi and English tokens during training. This stage added an extra layer of sophistication to Amabri's training journey.
The intention behind this phase was to establish a coherent relationship between English and corresponding Kannada tokens. Employing low-rank adaptation for fine-tuning, we encountered some challenges, notably with the decision to use a very low-rank value, which proved less effective. With a dataset size of 100,000 tokens, this stage presented limitations, and we acknowledge the need for improvements. As we refine this aspect of the training process, our commitment to enhancing the bilingual capabilities of Amabri remains unwavering.
In this pivotal stage, we employed supervised fine-tuning with low-rank adaptation to mold the model's responsiveness. Embracing a chat template structure consisting of user prompts/instructions and corresponding responses, we ventured into the realm of Bilingual Instruct Fine-tuning. This approach involved training the model to adeptly respond in either English or Kannada based on the language specified in the user prompt or instruction.
<|user|>{user prompt / instruction}<|endoftext|><|assistant|>{response}<|endoftext|>
For instance, given a user prompt/Instructionlike
1.Pay attention before studying—prepare your classes ahead of time for better teaching.
2. Develop a study strategy.-Make a manageable plan by creating a to-do list at a specific time each day.
3. Be sure to take regular breaks or exercise time to avoid confusion.
4. Get extra help.-Consider a change in contact with your teachers, classmates, family members or friends. There are
5. Collect information.-Learn from experience and make a list of reminders after class.
6. Work to enjoy reading.-Read a variety of books to maintain a balance that is incredibly easy to read.
7. Familiarize yourself with realism.-Provide a lesson that supports rationality and be sure to understand and understand what its results imply.
8. Choose ways to explore content.-Participate in community conversations with other students to start meaningful discussions.-Learn in-depth about all the topics you need in your educational setting using any special learning materials available in your classroom.
9. Allow for greater concentration.-Spend time on research because it is automatic but not complicated.
10. Be Organized Divide your time into a few subjects across multiple disciplines and learn to challenge yourself to complete tasks.
the model seamlessly generates a response in Kannada, maintaining linguistic coherence. To enrich the training process, we amalgamated various instruction datasets, including Alpaca Instruct, Dolly Instruct, and more. Leveraging translation APIs such as Google, Azure, and a custom deployment of the IndicTrans2 model from ai4bharat, we crafted a comprehensive bilingual instruct dataset.
The dataset, now publicly available on Hugging Face here, encompasses diverse linguistic scenarios. During training, we implemented supervised fine-tuning with four distinct representations:
This meticulous approach not only familiarized the model with responding in different languages but also laid the groundwork for mastering various cross-lingual tasks.
The weights of this finely-tuned model are accessible on Hugging Face🤗 - https://huggingface.co/Cognitive-Lab/Ambari-7B-Instruct-v0.1 , and for a hands-on experience, you can explore the 4-bit quantized version on chat.annyai.tech. (Its hosted on a T4 Instance so the inference will be slow).
In the culminating phase of our model refinement, we delved into the world of Direct Preference Optimization (DPO). This strategic choice, inspired by the success observed in various open-source models, aimed not only to align our model but also to drive improvements in benchmarks. Embarking on this experimental journey, we leveraged the Anthropic/hh-rlhf dataset. Translating it to Kannada, we subjected the model to DPO fine-tuning, currently undergoing a comprehensive evaluation to gauge its performance impact.
The following are the benchmarks comparing ambari with other indic models
Scope of Improvement
Our journey has been shaped by inspiration drawn from impactful projects. Special mention goes to OpenHathi Series by Sarvam.ai and the illuminating project Tamil Llama by Abhinand Balachandran. Their contributions have been instrumental in steering the course of our own endeavours.