Birth of Falcon 40B - Capable, Open Source, Permissible LLM from UAE. Emerging AI Super-Power?!

What if GPT4 becomes an open-source tomorrow? Integrating AI systems into your workflow is going to be cheaper and smoother. Now, Sam Altman has no plan for that. But UAE is here with its first language model, Falcon 40b.

In this article, we are going to dig deep and unearth things you should know about Falcon 40B- from the definition of " 40 B" to its training policy and model architecture. So, let's jump into it.

What is actually Falcon 40B?

The Technology Innovation Institute of the United Arab Emirates has developed an open-source LLM called Falcon 40-B. It is both the best-ranked LLM overall and the best-ranked royalty-free LLM worldwide. Falcon 40-B was so named because it was trained on a dataset containing 1 trillion tokens and 40 billion parameters.

‍

Amazon SageMaker JumpStart currently offers Falcon 40B in the AWS Regions of North America (Central), North America (East), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Mumbai), Europe (London), Europe (Frankfurt), Europe (Ireland), and Asia Pacific (Tokyo), with future availability in the AWS Regions of Europe (Tokyo), and Asia Pacific (Mumbai) expected.

Behind the Scenes: TII and AWS

Falcon LLM, a ground-breaking large language model with 40 billion parameters, has been released by the UAE's Technology Innovation Institute, the applied research pillar of Abu Dhabi's Advanced Technology Research Council. TII is a renowned international research institute that strives to expand human understanding. TII's scientists, researchers, and engineers are committed to providing cutting-edge research and game-changing innovations. The innovations developed at TII are intended to safeguard our culture in the foreseeable future.

Many companies, such as Stability AI, AI 21 Labs, Hugging Face, and LG AI, use Amazon SageMaker, AWS's end-to-end machine ML solution, to create, train, and release their LLMs because it improves performance and reduces costs across the board.

That's why TII used SageMaker on Amazon to simulate its Falcon 40B. SageMaker is a fully managed service, so TII didn't have to worry about maintaining its machine learning (ML) infrastructure, allowing it to concentrate instead on creating unique training processes and optimizing their usage. To cut down on training expenses and speed up the time it takes to get a product to market, TII implemented several optimizations, one of which was a proprietary matrix multiplication. To provide a seamless training process and minimize disruptions that needed developer attention, AWS collaborated closely with TII to increase resilience using SageMaker.

In May 2023, TII published the Apache 2.0 licensed Falcon 40B model. Since its debut, the Falcon 40B has shown superior performance over contemporaneous models without requiring fine-tuning.

Falcon vs LLaMa2

Falcon and LLaMA are two of the most talked lLLMs right now. We will compare the two models' size, training data, capabilities, and performance here.

Falcon vs Dolly2

Dolly 2.0 is an open-source LLM that provides interactive lessons similar to ChatGPT. How does it stack up against the Falcon 40B? Let's look at it-

Model Architecture and Objective

As a transformer-based model, Falcon-40B is trained to predict the next token in a series of tokens, per the causal language modeling objective. Its architecture is mainly based, but not entirely, on GPT-3's, with a few key modifications.

‍

FlashAttention

The FlashAttention mechanism is a key feature of Falcon 40-B's architecture. It represents an improvement over the traditional self-attention mechanism used in GPT-3. Self-attention allows the model to weigh the importance of different words or tokens within a sequence when processing and generating text. FlashAttention introduces optimizations to enhance the efficiency and speed of the attention mechanism, resulting in faster computation and improved performance.

‍

Positional Embeddings

Positional embeddings play a crucial role in language models to capture the sequential order of words or tokens in a text. In Falcon 40-B, positional embeddings represent each token's position or location within the input sequence. This helps the model understand the context and relationships between words. While GPT-3 also uses positional embeddings, Falcon 40-B may introduce specific modifications or enhancements to improve their effectiveness in capturing positional information.

Decoder Blocks

Decoder blocks are fundamental to transformer-based language models like GPT-3 and Falcon 40-B. They consist of layers that perform tasks such as self-attention, feed-forward neural networks, and layer normalization. These blocks are responsible for processing the input sequence and generating meaningful output. In Falcon 40-B, the design and configuration of decoder blocks may differ from GPT-3, potentially incorporating architectural changes or optimization strategies to improve performance or efficiency.

‍

How was Falcon trained?

Before we learn about the Falcon's training process, let's understand why you should even care about the training process.

Transparency

We can better comprehend an AI model's strengths, shortcomings, biases, and limits if we know how it was trained. Because of this openness, we can analyze the model's dependability and make informed judgments regarding its uses.

Accountability

When AI systems are deployed in vital areas such as healthcare, finance, or criminal justice, they must be rigorously trained. Understanding the training process helps establish responsibility and ensures that possible biases or inaccuracies are discovered and corrected.

Moral considerations

Training AI models necessitates the collection and use of big datasets. Knowing how a model was trained allows us to assess the ethical implications of data collecting, such as privacy, permission, and fairness.

‍

Reproducibility

Researchers and practitioners may imitate or replicate the training of AI models using detailed information about the training process. Reproducibility is an essential feature of scientific study since it allows for verification, improvement, and peer review.

Progress in the field

Sharing training methodology information promotes cooperation and the creation of more robust and effective AI models. Researchers may progress by studying various training strategies and building on prior work.

Here is how Falcon has been trained-

Source

Dataset Collection

The initial step involved collecting a large amount of publicly available data from the web. CommonCrawl, a project that regularly crawls the web and stores its contents, was used to gather this data.

Filtering Phase

The collected dataset underwent a thorough filtering phase to ensure high-quality training data. Various filtering techniques were applied during this phase to remove machine-generated text and adult content. Additionally, any duplicate content was eliminated to create a cleaner dataset.

Pretraining

The filtered dataset was used for the pre-training phase. Pre Training involves training a language model on a large corpus of text to learn the statistical patterns and structures of the language. In this case, the model was trained on the refined dataset, which consisted of nearly five trillion tokens.

Model Architecture

Falcon was built on top of the CommonCrawl dataset. It was designed to be multimodal-friendly, meaning it can effectively process and understand textual and visual information.

Training Hardware

To train Falcon, a significant amount of computational resources were required. Specifically, it utilized 384 GPUs (Graphics Processing Units) on the Amazon Web Services (AWS) platform. The training process took over two months to complete.

Token Count:

Falcon was trained on 1,000 billion (1 trillion) tokens from the RefinedWeb dataset. The token count refers to the number of individual units (such as words or subwords) used to represent the text during training.

Validation

Once the training process was finished, Falcon's performance was evaluated and validated against several open-source benchmarks. These benchmarks, such as EAI Harness, HELM, and BigBench, are standardized tests to assess the language model's capabilities and compare it to other models.

How FalconB can outperform GPT4?

Training Data

Falcon 40-B is trained on a vast amount of data, precisely word embeddings representing words or phrases in a numerical vector space. This training data is from two primary datasets: the "refined web" and Reddit conversations.

Refined Web Dataset

The refined web dataset is built upon the extensive archives of CommonCrawl. CommonCrawl is a non-profit organization that has been collecting web data since 2008. It stores petabytes (1 petabyte = 1 million gigabytes) of data from various websites. The refined web dataset is a subset of this collection, carefully selected and processed to ensure high quality and relevance.

Reddit Conversations

In addition to the refined web dataset, Falcon 40-B also benefits from training on Reddit conversations. Reddit is a popular social media platform where users discuss and share content. Falcon 40-B gains exposure to diverse and informal language usage by including Reddit conversations in the training data.

High-Quality Training Data

The training data used for Falcon 40-B is described as exceptionally high quality. This implies that significant efforts have been made to curate and filter the data to ensure relevance and reliability. High-quality training data is crucial for language models as it helps them learn accurate representations of language patterns and improve their performance.

Unparalleled Strength and Effectiveness

The sheer magnitude of the training data, drawn from the vast archives of CommonCrawl and Reddit conversations, contributes to Falcon 40-B's unparalleled strength and effectiveness. By training on such a large and diverse dataset, Falcon 40-B can learn from various linguistic patterns, semantic relationships, and contextual nuances, enabling it to generate more accurate and contextually appropriate responses.

‍

So What can Falcon 40B actually do?

Falcon 40B has several possible applications.

Natural language processing (NLP)

In natural language processing (NLP), the outstanding Falcon 40B language model has several potential uses. Classifying texts, analyzing emotions, and answering questions are only some of the duties it can accomplish well. It can organize texts into blog entries, product reviews, and news items. As a bonus, it can also analyze the tone of a piece of text and tell you if it's good, harmful, or neutral. Besides, Falcon 40B may offer detailed information about a text, such as who wrote it, what it's about, or what's most important.

Generative text tasks

Falcon 40B excels in generative text jobs, creating original material like poetry, code, screenplays, musical compositions, emails, and letters. It can write rhymed poetry in a prescribed meter, generate code with logical syntax, and scripts with natural dialogue.

Chatbots

Falcon 40B is also quite good at creating chatbots. It paves the way for developing conversational bots that can perform tasks like answering questions about goods and services and providing customer care.

Machine translation

Falcon 40B simplifies machine translation since it accurately translates text across languages, allowing people from different cultural backgrounds to share ideas and information without language barriers.

Data augmentation

Falcon 40B is beneficial in data augmentation, where it improves the efficiency of machine-learning models by simulating real-world data via synthetic data generation.

Research

Falcon 40B allows researchers to test novel approaches to deploying LLMs across various applications. It can improve natural language processing (NLP), machine translation (MT), chatbot creation, and even more out-of-the-box uses like fiction authoring and video game design.

Are we going to choose Falcon40b over GPT4?

‍

Source

Both ChatGPT and Falcon-40b were able to make a Python script that outputted numbers from 1 to 100 in the coding jobs. But ChatGPT couldn't make the game "snake" in Python, and Falcon-40b's performance in this job was wrong. This shows that both models' code generation needs to be improved.

When it came to creative writing, both ChatGPT and Falcon-40b did a great job of writing amazing 50-word poems about AI. Their acts were good and showed how creative they could be.

Both ChatGPT and Falcon-40b knew who the president of the United States was in 1996. This shows that they can give the right information about historical events.

In terms of logic and thinking, both ChatGPT and Falcon-40b showed that they could figure out the order of people's speeds and answer accurate questions about relative rates.

In math, ChatGPT got the right answer to the expression "25 – 4 * 2 + 3," while Falcon-40b got the wrong answer. This shows that Falcon-40b needs to get better at answering math problems.

In planning tasks, ChatGPT and Falcon-40b made healthy meal plans for the day that were good enough. This showed that they could help with practical and health-related issues.

To wrap it up…

ChatGPT and Falcon-40b's comparative assessment across tasks provides useful insights into their performance. Both models had their advantages and disadvantages. When it came to writing ability, knowledge, reasoning ability, and speed, ChatGPT was unrivaled.

Even while Falcon-40b showed promise in the creative writing task, it still has room for development in the code generation and mathematical problem-solving tasks. The strengths and weaknesses of each model can help determine when and how they should be used. The models will continue to improve as AI research progresses, increasing their usefulness and influence.

Are you planning to integrate AI into your workflow? Schedule a call with our CEO, Rohan, today!

‍