### Configuration The following YAML configuration was used to produce this model: ```yaml slices: - sources: - model: appvoid/arco-2 layer_range: [0, 12] - sources: - model: h2oai/h2o-danube3-500m-base layer_range: [12, 16] merge_method: passthrough dtype: float16 ``` Here’s a draft Model Card for your new model based on the evaluations you have provided. Model Card: Custom AI Model (514M Parameters) Model Details • Architecture: This model is based on a fine-tuned large language model with 514M parameters, designed for handling a variety of commonsense reasoning tasks and general knowledge. The model has undergone multiple rounds of evaluation and focuses on tasks like ARC Challenge, HellaSwag, PIQA, and Winogrande. • Model Size: 514M parameters Use Case and Intended Applications This model is designed for tasks requiring: • Commonsense Reasoning: Understanding and predicting everyday physical and linguistic scenarios. • Text Comprehension: Handling tasks that require completion or understanding of real-world descriptions and ambiguous situations. • General Knowledge: Reasoning through questions that require broad, general understanding of knowledge domains, such as multiple-choice exams. Training Data The model was fine-tuned on various datasets to optimize its performance in the following areas: • Physical Reasoning: Datasets like PIQA help the model to reason about physical situations and solutions. • Commonsense and Ambiguous Reasoning: Datasets like HellaSwag and Winogrande help the model make sense of events or situations that require a high degree of commonsense understanding. • General Knowledge: The ARC Challenge dataset allows the model to answer multiple-choice questions that test general reasoning skills. Evaluation Results The model was evaluated across a range of tasks. Below are the final evaluation results (after removing GSM8k): | Parameters | Model | MMLU | ARC-C | HellaSwag | PIQA | Winogrande | Average | |------------|-----------|-------|--------|-----------|--------|------------|---------| | 500M | qwen 2 | 44.13 | 28.92 | 49.05 | 69.31 | 56.99 | 49.68 | | 500M | qwen 2.5 | 47.29 | 31.83 | 52.17 | 70.29 | 57.06 | 51.72 | | 1.24B | llama 3.2 | 36.75 | 36.18 | 63.70 | 74.54 | 60.54 | 54.34 | | 514M | archeon | NA | 32.34 | 47.80 | 74.37 | 62.12 | 54.16 | • ARC Challenge: The model performs decently in answering general knowledge questions. • HellaSwag: The model is strong in commonsense reasoning, performing well in predicting the next sequence of events in a given scenario. • PIQA: The model excels in physical reasoning, showcasing a solid understanding of everyday physical interactions. • Winogrande: It also shows competitive performance in linguistic reasoning tasks. Key Strengths 1. Physical and Commonsense Reasoning: The model consistently performs well in tasks like PIQA and HellaSwag, showcasing strong abilities in understanding and predicting physical scenarios and commonsense events. 2. Linguistic Reasoning: The model also performs competitively in tasks like Winogrande, which tests linguistic understanding and ambiguity resolution. Key Weaknesses 1. General Knowledge (ARC Challenge): While the model does reasonably well, it lags behind top models in handling more challenging general knowledge questions. 2. Math Reasoning: Performance on numerical reasoning tasks like GSM8k was excluded due to poor performance, indicating a potential area for future improvement with further fine-tuning. Recommendations for Improvement • Fine-Tuning on Mathematical Reasoning: To improve on GSM8k and other math-heavy tasks, consider fine-tuning on datasets like MathQA or MATH. • Enhanced General Knowledge: To further enhance performance in general knowledge tasks (ARC Challenge), additional fine-tuning with datasets like SQuAD, TriviaQA, or other large knowledge datasets would be beneficial. Model Usage This model is well-suited for a variety of NLP tasks where commonsense reasoning and physical reasoning are required, such as: • Answering multiple-choice questions (e.g., exam preparation, automated tutoring). • Text completion tasks (e.g., completing sequences of events). • Commonsense AI applications (e.g., chatbot responses requiring real-world understanding). Limitations • Mathematical Reasoning: The model struggles with tasks requiring numerical problem-solving or complex logical reasoning in math. • Context-specific Fine-tuning: The model may require additional fine-tuning for specialized tasks outside of its current scope (e.g., legal reasoning, scientific document comprehension). Ethical Considerations This model inherits the limitations and biases of the datasets it was trained on. It may exhibit biases present in general knowledge corpora and might not perform well in niche domains unless explicitly fine-tuned for such tasks.