INTRODUCTION

It is our team's pleasure to work with you and offer our latest cutting-edge Language Model (LLM)-Enteli-49B for your business needs. This collaboration marks a significant step in utilizing advanced Natural Language Processing (NLP) to enhance your business operations.

This Hugging Face repository is divided into 5 sections; Model Architecture, Model Usage, Immediate Integration, Deployment and Future Work . Please check out our demo for that model: https://huggingface.co/spaces/arhanovich/Enteli-49B_Demo

Key Features of Enteli-49B:

-SOTA Performance: As it can be discerned from the benchmarks, Enteli-49B outperforms incredibly other language models like GPT-3.5. Our LLM excels in understanding and generating human-like text with advanced reasoning, coding and math abilities.

-Customization and Scalability: Tailor the model to your specific industry needs, ensuring relevance and efficiency in a plethora of tasks.

-Computational Efficiency: Regarding its high performance, our LLM's parameter size is relatively low and has less computational intensity for interference

-Seamless Integration: Easy integration with your existing systems and workflows.

Choosing HuggingFace for Delivery and Demonstration:

Our choice of HuggingFace as the platform for demonstration and delivery of our LLM to your sides is strategic and deliberate. HuggingFace is well-known for its robust, user-friendly, and versatile environment. This platform not only simplifies the integration and deployment of advanced AI models but also ensures that you stay at the forefront of AI technology with continuous updates and community support. Prominent firms in the field of AI like Google, Meta, Openai and Microsoft take advantage of HuggingFace for sharing LLMs safely and easily.

MODEL ARCHITECTURE

It is pondered as an endorsed fact that successful LLMs like GPT-4 have been trained using a method called Mixture of Experts due its great performance and higher efficiency. Thus, we, as EnteliMind trained Enteli-49B using the Mixture of Experts algorithm.

When it comes to improving the quality of machine learning models, scale is key. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.An intriguing approach to achieve better scale with limited computational resources is the Mixture of Experts (MoE) model. This method allows for larger models or datasets to be pre-trained using the same compute budget as traditional dense models, but with significantly faster results. Instead of training a single language model where its training would be like a "black box", unaware of its domain-specific abilities, expert models can be separately trained with each expert dedicated to a single ability.

At its core, a MoE model comprises two primary components:

Sparse MoE Layers: These replace the usual dense feed-forward network (FFN) layers. A MoE layer consists of several "experts" – each being a separate neural network. Typically, these experts are FFNs themselves, but they can also be more intricate, even forming hierarchical structures.

Gate Network/Router: This component directs specific tokens to specific experts. For instance, one token might be routed to one expert while another goes to a different one. The routing process is critical in MoE models and is based on learned parameters that are pre-trained alongside the network.

Gating Network Mechanics:

The gating network's function is to efficiently distribute input across various experts. It's mathematically defined as:

Sparsity and Conditional Computation:

Sparsity in MoE models is about using conditional computation - activating only parts of the network for specific inputs. This approach enables scaling up the model size without a proportional increase in computation. This is mathematically represented as:

Where y is the output, G(x) is the gating function, _Ei_(x) is the operation by the i-th expert, and n is the number of experts.

Innovative Gating and Load Balancing:

Beyond traditional gating, techniques like Noisy Top-k Gating add noise to the gating process, keeping only the top k values. This method, while introducing complexity, aids in faster training and inference by activating fewer experts. Additionally, noise helps in load balancing, ensuring an equitable distribution of tokens among experts, preventing any single expert from becoming a bottleneck. Here is its mathematical representation:

Our Research Findings:

We have simplified our own entire model architecture to the transformer module's mixture of experts known as "MixtralForCausalLM". This allows for easy integration with the HuggingFace and the transformers module which will certainly facilitate the future work like Supervised Fine-tuning.

However, it is best to acknowledge that the difference between the original implementation and the simplified version is pretty minute and we would like to share our extra research findings when training Enteli-49B.

1-) Exponential Mean Absolute Deviation Normalization (EMADNorm):

Enteli-49B incorporates EMADNorm to normalize the data, which divides each element by an exponential factor dependent on the dataset's mean absolute deviation (MAD). The MAD and EMADNorm are defined as:

Where N is the number of elements, xi is each individual element, μ is the mean of all elements, and e is the base of the natural logarithm.

EMADNorm focuses on the spread of the data by considering the mean absolute deviation. This aspect is particularly beneficial in datasets where the dispersion is an important feature and needs to be emphasized or normalized differently from the mean. By using an exponential function of the MAD, EMADNorm adapts the degree of normalization to the characteristics of the dataset. This adaptability can be crucial for datasets with varying levels of volatility or dispersion. Moreover, by normalizing the input data effectively, EMADNorm can contribute to more stable and efficient model training. It ensures that the scale of the inputs does not adversely affect the learning process, which can be critical for the convergence and performance of deep learning models.

2-) CurveLu Activation Function

The feed-forward network in Enteli-49B utilizes the CurveLu activation function, a blend of ReLU and Tanh, allowing sensitivity to both positive and negative inputs. The network can be represented as:

Where the Curvelu activation function equals to:

And k is a hyper-parameter that dictates the steepness of the tanh function or it can be either set as a constant 1.

This novel activation function is both smooth and more forgiving to positive values as it can be discerned from the graph below.

More Details:

Enteli-49B is pre-trained on data extracted from the open Web with experts and routers trained simultaneously with over 1.9 Trillions of tokens.

Benchmarks

	Enteli-49B (EnteliMind)	GPT 3.5 (OpenAI)	LLaMa 70B (Meta AI)
MMLU	73.6%	70%	69.9%
HelloSwag (10-shot)	90.6%	85.5%	87.1%
ARC Challenge (25-shot)	87.9%	85.2%	85.1%
WinoGrande (5-shot)	83.2%	81.6%	83.2%
GSM-8K (5-shot)	61.1%	57.1%	53.6%

These benchmarks indicate that our model outperforms models like GPT-3.5 and LLaMa 2 70b although having fewer parameter size.

Model Usage

Our model can be easily used with the transformers python library.

The chat template that must be strictly used is as follows:

\<s\> [INST] There goes the prompt [/INST] There goes the answer\</s\> [INST] Follow-up prompt [/INST]

<s> is the BOS (Beginning of string)
</s> is the EOS (End of String)

Here is an example code for the model usage in python using GPU:

#pip install transformers accelerate bitsandbytes 
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "arhanovich/Enteli-49B"
auth_token = "There goes the auth token" #Since this a private model, you must use that auth token to access the model and the tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name, use_default_system_prompt=False, use_auth_token=auth_token)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, device_map='auto',local_files_only=False, load_in_4bit=True, use_auth_token=auth_token)

prompt = input("Query: ")
full_prompt = f"<s>[INST] You are a helpful AI called Enteli trained by the AI company EnteliMind.[/INST]\nUser: {prompt}\nAssistant:"
input_ids = tokenizer(full_prompt, return_tensors="pt").input_ids.to("cuda")
generation_output = model.generate(
input_ids=input_ids, max_new_tokens=500)
answer = str(tokenizer.decode(generation_output[0], skip_special_tokens=True)).replace(full_prompt, "")
print(f"Answer: {answer}")

Important Notes:

This chat template must be strictly used
In this code torch.float32 has been used however, alternatively, torch.float16 could also be used which can lead to faster computations and lower memory usage but at the cost of precision.
In this code model has been loaded with 4-bit which refers to a form of model quantization where the weights of a neural network are represented using only 4 bits per weight. Quantization reduces the model size and can speed up inference. However, for the sake of precision, it can be replaced with for example 32 bit which would require more memory and hardware like GPU accelerator.
Other parameters of the model.generate() such as temperature, top_p, top_k or max_new_tokens can also be altered upon request

Immediate Integration

In the dynamic landscape of artificial intelligence, the fusion of Enteli-49B with external functions heralds a groundbreaking era of innovation and utility. This integration is not just an advancement; it's a revolution, poised to redefine the boundaries of technology and human interaction.

To exemplify, here are some potential use cases of the combination of Enteli-49B with external functions:

Combining it with a calculator function to enable it carry out flawless calculations
Combining it with a web browser or a search engine, making it aware of the current data
Combining it with complex financial calculation tools like market analysis or investment portfolio.

Thus, any API or function in a coding environment can be integrated with Enteli-49B. Things get very interesting when you combine multiple Enteli-49B with each one having its tools, enabling it to carry out complex tasks that humans are not able to perform efficiently. This can be potentailly be the dawn of a new form of intelligence.

We, as EnteliMind team, have written two examplar scripts that will be a starting-point of that journey:

Example1: Integration with functions of Single Paramter

In the first example script, we are combining Enteli-49B with a calculator tool and a webbrowser. Here is the code:

pip install transformers accelerate bitsandbytes duckduckgo_search

import torch
import transformers

model_name = "arhanovich/Enteli-49B"

auth_token = "There goes the auth token" #Since this a private model, you must use that auth token to access the model and the tokenizer

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, use_default_system_prompt=False, use_auth_token=auth_token)
model = transformers.AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, device_map='auto',local_files_only=False, load_in_4bit=True, use_auth_token=auth_token)

generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=False, 
    task="text-generation",
    temperature=0.1,  # 'randomness' of outputs, 0.0 is the min and 1.0 the max
    top_p=0.15,  # select from top tokens whose probability add up to 15%
    top_k=0,  # select from top 0 tokens (because zero, relies on top_p)
    max_new_tokens=512,  # max number of tokens to generate in the output
    repetition_penalty=1.1  
)


def instruction_format(sys_message: str, query: str):
    return f'<s> [INST] {sys_message} [/INST]\nUser: {query}\nAssistant: ```json\n{{\n"tool_name": '

system_message= """You are a helpful AI assistant, you are an agent capable of using a variety of tools to answer a question. Here are a few of the tools available to you:

- Calculator: the calculator should be used whenever you need to perform a calculation, no matter how simple. It uses Python so make sure to write complete Python code required to perform the calculation required and make sure the Python returns your answer to the `output` variable.
- Search: the search tool should be used whenever you need to find information. It can be used to find information about everything
- Final Answer: the final answer tool must be used to respond to the user. You must use this when you have decided on an answer.

TOOL USAGE

Let's get started. The users query is as follows.
"""

import json

def format_output(text: str):
    full_json_str = '{\n"tool_name": '+text
    full_json_str = full_json_str.strip()
    if full_json_str.endswith("```"):
        full_json_str = full_json_str[:-3]
    return json.loads(full_json_str)

from duckduckgo_search import DDGS

def use_tool(action: dict):
    tool_name = action["tool_name"]
    if tool_name == "Final Answer":
        return "Assistant: "+action["input"]
    elif tool_name == "Calculator":
        exec(action["input"])
        return f"Tool Output: {output}"
    elif tool_name == "Search":
        contexts = []
        with DDGS() as ddgs:
            results = ddgs.text(
                action["input"],
                region="wt-wt", safesearch="on",
                max_results=3
            )
            for r in results:
                contexts.append(r['body'])
        info = "\n---\n".join(contexts)
        return f"Tool Output: {info}"
    else:
        # otherwise just assume final answer
        return "Assistant: "+action["input"]


def run_agent(query: str):
    res = generate_text(query)
    action_dict = format_output(res[0]["generated_text"])
    response = use_tool(action_dict)
    full_text = f"{query}{res[0]['generated_text']}\n{response}"
    return response, full_text


query = input(">: ")

input_prompt = instruction_format(system_message, query)

out = run_agent(input_prompt)
print(out)

second_step = out[1]+"""
Assistant: ```json
{
    "tool_name": """

out = run_agent(second_step)

print(out[0])

This code sets up a basic AI agent. Note that, python libraries such as Langchain or LlamaIndex could also be utilised for building the agent. Also, the custom cools (and the corresponding system prompt) can be altered for different functionalities.

Also replace the TOOL USAGE part with:

To use these tools you must always respond in JSON format containing `"tool_name"` and `"input"` key-value pairs. For example, to answer the question, "what is the square root of 51?" you must use the calculator tool like so:

```json
{
    "tool_name": "Calculator",
    "input": "from math import sqrt; output = sqrt(51)"
}
```

Or to answer the question "who is the current president of the USA?" you must respond:

```json
{
    "tool_name": "Search",
    "input": "current president of USA"
}
```

Remember, even when answering to the user, you must still use this JSON format! If you'd like to ask how the user is doing you must write:

```json
{
    "tool_name": "Final Answer",
    "input": "How are you today?"
}
```

Example2:Integration with functions of Multiple Paramters

In this example, we will be building a Finance Agent that has tools of Compund Interest, Present Value Annuity and Capital Asset Pricing calculation.

#pip install transformers accelerate bitsandbytes
import torch
import transformers
auth_token = "There goes the auth token" #Since this a private model, you must use that auth token to access the model and the tokenizer
model_name = "arhanovich/Enteli-49B"

tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, use_default_system_prompt=False, use_auth_token=auth_token)
model = transformers.AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, device_map='auto',local_files_only=False, load_in_4bit=True, use_auth_token=auth_token)


def generate_text(query):
    system_message = """
    <s>[INST]You are a helpful AI assistant, you are an agent capable of using a variety of tools to answer a question. Here are a few of the tools available to you:
    
    - Compund Interest: Calculate the future value of an investment with compound interest. :param principal: Initial amount of money invested (principal) :param rate: Annual interest rate (as a decimal) :param periods: Number of periods the money is invested for :return: Future value of the investment.
    - Present Value Annuity: Calculate the present value of an annuity :param payment: The fixed payment amount per period :param rate: Discount rate per period (as a decimal).:param periods: Total number of periods :return: Present value of the annuity.
    - Capital Asset Pricing: Calculate the expected return of an asset using the Capital Asset Pricing Model (CAPM) :param expected_market_return: Expected return of the market :param risk_free_rate: Risk-free rate of return :param beta: Beta of the asset :return: Expected return of the asset.
    - Final Answer: the final answer tool must be used to respond to the user. You must use this when you have decided on an answer. :param answer:Your final answer
    
    To use these tools you must always respond in JSON format containing `"tool_name"` and `"parameters"` key-value pairs. 
    
    For example, to answer the question, "Suppose you invest $5,000 in a savings account offering an annual interest rate of 4%. How much money will be in the account after 10 years if the interest is compounded annually?" you must use the tool like so:
    
    ```json
    {
        "tool_name": "Compund Interest",
        "input": "principal=5000, rate=0.04, periods=10"
    }
    ```
    
    Or to answer the question "You are considering an investment that will pay you $1,000 per year for the next 5 years. If your discount rate is 3%, what is the present value of these future payments?" you must respond:
    
    ```json
    {
        "tool_name": "Present Value Annuity",
        "input": "payment=1000, rate=0.03, periods=5"
    }
    ```
    
    To answer the question "An asset has a beta of 1.2. The risk-free rate is 2%, and the expected market return is 8%. What is the expected return on this asset according to the CAPM?" use the tool like that
    ```json
    {
        "tool_name": "Capital Asset Pricing",
        "input": "expected_market_return=0.08, risk_free_rate=0.02, beta=1.2"
    }
    ```
    
    Remember, even when answering to the user, you must still use this JSON format! Example, if the Present Value of the Annuity tool gave an ouput like that: 4987.76
    
    ```json
    {
        "tool_name": "Final Answer",
        "input": "answer: The Present Value of the Annuity is 4987.76"
    }
    ```
    
    Let's get started. The users query is as follows. You must always give your answer in JSON fomat!!!
    User: """
    
    full_prompt = system_message + query + "[/INST]" 
    
    input_ids = tokenizer(full_prompt, return_tensors="pt").input_ids.to("cuda")
    
    generation_output = model.generate(input_ids=input_ids, max_new_tokens=1024, temperature=0.6, top_p=0.9, top_k=50)
    answer = str(tokenizer.decode(generation_output[0], skip_special_tokens=True))
    answer = answer.split("[/INST]")[-1].strip()
    return answer

import json
import re
def format_output(text: str):
    # Find the JSON part in the text
    start = text.find("{")
    end = text.rfind("}") + 1
    if start == -1 or end == -1:
        raise ValueError("JSON string not found in the text")

    # Extract the JSON string
    json_str = text[start:end]

    # Parse the JSON string
    try:
        json_obj = json.loads(json_str)
    except json.JSONDecodeError:
       match = re.search(r'"answer":\s*"([^"]+)"', text)
       if match:
           return match.group(1)
       else:
           raise ValueError("Answer not found")
        


    # Ensure the necessary keys are present
    if "tool_name" not in json_obj or "input" not in json_obj:
        raise ValueError("Required keys ('tool_name', 'input') are missing in the JSON")

    # Extract and parse the parameters
    try:
        parameters_str = json_obj["input"]
        params = dict(param.split("=") for param in parameters_str.split(", "))

        # Convert parameter values to appropriate type (int, float, or leave as string)
        def convert_value(v):
            try:
                return float(v) if '.' in v else int(v)
            except ValueError:
                return v  # If conversion to int or float fails, return the string as is

        params = {k: convert_value(v) for k, v in params.items()}
    except Exception as e:
        raise ValueError(f"Error parsing parameters: {e}")

    return json_obj["tool_name"], params


def compound_interest(principal, rate, periods):
    """
    Calculate the future value of an investment with compound interest.
    :param principal: Initial amount of money invested (principal).
    :param rate: Annual interest rate (as a decimal).
    :param periods: Number of periods the money is invested for.
    :return: Future value of the investment.
    """
    return principal * (1 + rate) ** periods
    
def present_value_annuity(payment, rate, periods):
    """
    Calculate the present value of an annuity.
    :param payment: The fixed payment amount per period.
    :param rate: Discount rate per period (as a decimal).
    :param periods: Total number of periods.
    :return: Present value of the annuity.
    """
    return payment * ((1 - (1 + rate) ** -periods) / rate)

def capm(expected_market_return, risk_free_rate, beta):
    """
    Calculate the expected return of an asset using the Capital Asset Pricing Model (CAPM).
    :param expected_market_return: Expected return of the market.
    :param risk_free_rate: Risk-free rate of return.
    :param beta: Beta of the asset.
    :return: Expected return of the asset.
    """
    return risk_free_rate + beta * (expected_market_return - risk_free_rate)
    
    
def final_answer(answer):
    return answer



def use_tool(tool_name, params):
    if tool_name == "Final Answer":
        result = final_answer(**params)
        return "Assistant:" + result
        
    elif tool_name == "Capital Asset Pricing":
        result = capm(**params)
        return "Tool Output:" + str(result)
        
    elif tool_name == "Present Value Annuity":
        result = present_value_annuity(**params)
        return "Tool Output:" + str(result)
    elif tool_name == "Compound Interest":
        result = compound_interest(**params)
        return "Tool Output:" + str(result)
        
    else:
        return "Assistant: An error occured"


def run_agent(query: str):
    res = generate_text(query)
    print(res)
    tool_name, params = format_output(res)
    response = use_tool(tool_name, params)
    full_text = f"{query}{res}\n{response}"
    return response, full_text


query= input(">: ")
out = run_agent(query)
print(f"Result: {out[0]}")

#You can run the second outputs and get the final results using the same logic as in the previous example

Deployment Enteli-49B, a sophisticated AI model, necessitates a minimum of 95GB of VRAM for optimal operation. It functions efficiently on dual A100 80GB systems, where each A100 is equipped with 80GB of VRAM, 117GB of RAM, and 12 VCPUs.

The model is compatible with virtual machines, with affordable options available through runpod.io. On average, the model processes and outputs a total of 500 tokens in approximately 35 seconds when utilizing a dual A100 80GB setup.

In the context of this model, 'tokens' represent fragments of words. During the initial processing phase, the input is segmented into these tokens, which may consist of partial words, spaces, or even sub-words. For the English language, a single token is roughly equivalent to three-quarters of a word.

The cost for one-hour usage of a dual A100 80GB system on Runpod is approximately 4 USD. Consequently, processing 1,000,000 tokens (equivalent to around 750,000 words) would incur a cost of about 75 USD. However, this approach allows for processing only one prompt at a time and presents challenges in GPU management. Additionally, time-based GPU rental can lead to inefficiencies, as the model may not be in constant use. Thus, employing services like Runpod might not be the most user-friendly option for consumers.

Fortunately, at EnteliMind, we have access to extensive, dedicated computational servers equipped with numerous GPUs. We aim to offer an API service where you are billed based on token usage. Last month, our usage amounted to approximately 948,750,000 tokens, costing us 7590 USD. From this data, we deduce that the cost for processing 1,000,000 tokens (about 750,000 words) is 8 USD. Therefore, we are prepared to offer you our API service at a rate of 8 USD per 1 million tokens, subsequent to the purchase of our Enteli-49B AI model.

Future Work

The abilities of Enteli-49B can be amplified with these several mathods.

1-)Building More Complex AI Agents and Swarms

This is probably the cheapest and the easiest method, though it will produce the best results. The abilities of Enteli-49B could be expanded with many custom tools (functions) integrated with. This opens wide avanues to the innovation in finance for example, combining the AI with any imaginable tool. Another advanced method is builiding an AI agent swarm where different AIs with their toolsi talk, negotiate with each other to solve pretty intricate problems. This may sound a bit hard to implement however the projects CrewAI and Autogen have simplified this process immensely.

Langchain: https://python.langchain.com/docs/get_started/introduction CrewAI: https://github.com/joaomdmoura/crewAI Autogen: https://github.com/microsoft/autogen

2-)Fine-Tuning

Fine-tuning is a crucial step in enhancing the capabilities of pre-trained large language models (LLMs) for specific tasks or domains. Initially, these models are trained on vast and diverse datasets, equipping them with a broad understanding of language and its various applications. However, this general training doesn't provide the model with deep expertise in particular areas or specialized tasks.

To address this, fine-tuning comes into play. It involves adjusting the model's parameters further, but this time using a smaller, domain-specific dataset. This process is akin to giving the model a "mini-education" in a particular field or task, allowing it to become more adept and efficient in that area.

During fine-tuning, the model is exposed to examples that are closely related to the specific task at hand. This exposure helps the model to grasp the subtleties and nuances of the domain, which might not have been covered during its initial training. For instance, a model trained on a general dataset may have a basic understanding of medical terminology, but through fine-tuning with medical texts, it can develop a much more refined and accurate understanding of this domain.

The result of fine-tuning is a more specialized version of the language model, tailored to perform better in specific applications. It effectively narrows the gap between a general-purpose model and a specialized tool, unlocking new possibilities and enhancing the model's performance in targeted tasks. This makes fine-tuning an invaluable process for realizing the full potential of LLMs in various domains and applications.

One of the most well-known technique is PEFT (Parameter-Efficient Fine-Tuning). It is a library for efficiently adapting large pretrained models to various downstream applications without fine-tuning all of a model’s parameters because it is prohibitively costly. PEFT methods only fine-tune a small number of (extra) model parameters - significantly decreasing computational and storage costs - while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware.

PEFT: https://huggingface.co/docs/peft/index

EnteliMindDelivery
/

Enteli-49B

Benchmarks

Model tree for EnteliMindDelivery/Enteli-49B