{ "cells": [ { "cell_type": "markdown", "id": "7eb11cd7-6200-45f9-822f-45b8b392c4bd", "metadata": {}, "source": [ "# Setup" ] }, { "cell_type": "markdown", "id": "4ca90151-0b89-4870-8213-9d49cb68c555", "metadata": {}, "source": [ "## Config\n", "Set the tokens based on the numbers in [03-poe-token-count-exploration.ipynb](03-poe-token-count-exploration.ipynb). I like to give a little buffer in-case an explanation goes over." ] }, { "cell_type": "code", "execution_count": 1, "id": "5238c6e9-9425-4ced-a16a-998e775e7342", "metadata": {}, "outputs": [], "source": [ "INPUT_TOKENS = 300\n", "OUTPUT_TOKENS = 1650\n", "\n", "INPUT_DATASET = 'derek-thomas/labeled-multiple-choice-explained-falcon-tokenized'\n", "OUTPUT_DATASET = 'derek-thomas/labeled-multiple-choice-explained-falcon-results'\n", "BASE_MODEL = 'tiiuae/Falcon3-7B-Instruct'" ] }, { "cell_type": "markdown", "id": "f4eca659-f11f-4d25-886a-9d7af4f38411", "metadata": {}, "source": [ "# Setup\n", "Here we create the pydantic models for each of our experiments. Note because of how you specify field names in pydantic, we need to use an `alias` and `populate_by_name`. Given that our `Final Answer` is always a letter between a-h we can use an enumeration." ] }, { "cell_type": "code", "execution_count": 2, "id": "c5367700-0e9d-435b-875a-02a73b292ade", "metadata": {}, "outputs": [], "source": [ "from pydantic import BaseModel, Field\n", "from typing import List\n", "from enum import Enum\n", "import json\n", "\n", "\n", "class FinalAnswerEnum(str, Enum):\n", " a = \"a\"\n", " b = \"b\"\n", " c = \"c\"\n", " d = \"d\"\n", " e = \"e\"\n", " f = \"f\"\n", " g = \"g\"\n", " h = \"h\"\n", "\n", "class RFAModel(BaseModel):\n", " reasoning: str = Field(...)\n", " final_answer: FinalAnswerEnum = Field(...)\n", "\n", " class Config:\n", " populate_by_name = True\n", " \n", "class FARModel(BaseModel):\n", " final_answer: FinalAnswerEnum = Field(...)\n", " reasoning: str = Field(...)\n", "\n", " class Config:\n", " populate_by_name = True\n", " \n", "class FAModel(BaseModel):\n", " final_answer: FinalAnswerEnum = Field(...)\n", "\n", " class Config:\n", " populate_by_name = True" ] }, { "cell_type": "markdown", "id": "7e0f51c0-c4f7-4299-9a24-a4a90d4a9f2a", "metadata": {}, "source": [ "We generated lots of experiments in [derek-thomas/labeled-multiple-choice-explained-falcon-tokenized](https://huggingface.co/datasets/derek-thomas/labeled-multiple-choice-explained-falcon-tokenized/viewer?row=0). Now we will aggregate everything we need in `experiments` for convenience." ] }, { "cell_type": "code", "execution_count": 3, "id": "5d0bd22f-293e-4c15-9dfe-8070553f42b5", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/plain": [ "'derek-thomas/falcon-v03-poe-RFA-falcon,derek-thomas/falcon-v03-poe-FAR-falcon,derek-thomas/falcon-v03-poe-RFA-gpt3-5,derek-thomas/falcon-v03-poe-FAR-gpt3-5,derek-thomas/falcon-v03-poe-FA'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "experiments = {\n", " 'RFA-falcon': {\n", " 'pydantic': RFAModel,\n", " \"lora\": \"derek-thomas/falcon-v03-poe-RFA-falcon\",\n", " \"column\": 'user_prompt_RFA',\n", " },\n", " 'FAR-falcon': {\n", " 'pydantic': FARModel,\n", " \"lora\": \"derek-thomas/falcon-v03-poe-FAR-falcon\",\n", " \"column\": 'user_prompt_FAR',\n", " },\n", " 'RFA-gpt3-5': {\n", " 'pydantic': RFAModel,\n", " \"lora\": \"derek-thomas/falcon-v03-poe-RFA-gpt3-5\",\n", " \"column\": 'user_prompt_RFA',\n", " },\n", " 'FAR-gpt3-5': {\n", " 'pydantic': FARModel,\n", " \"lora\": \"derek-thomas/falcon-v03-poe-FAR-gpt3-5\",\n", " \"column\": 'user_prompt_FAR',\n", " },\n", " 'FA': {\n", " 'pydantic': FAModel,\n", " \"lora\": \"derek-thomas/falcon-v03-poe-FA\",\n", " \"column\": 'user_prompt_FA',\n", " },\n", " 'base': {\n", " 'pydantic': FAModel,\n", " \"lora\": None,\n", " \"column\": 'user_prompt_FA',\n", " },\n", "}\n", "\n", "LORAS_STRING = ','.join([v['lora'] for _, v in experiments.items() if v and v.get('lora') is not None])\n", "LORAS_STRING" ] }, { "cell_type": "code", "execution_count": 4, "id": "6f8826fb-76ea-464f-8146-262bda0b58bc", "metadata": { "tags": [] }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c10cc7b616e2475f8d25dd3967b1ed79", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HTML(value='
=2.4.2 to get ordering of json outputs\n", " \"env\": {\n", " \"LORA_ADAPTERS\": LORAS_STRING,\n", " \"MAX_BATCH_PREFILL_TOKENS\": str(20*INPUT_TOKENS),\n", " \"MAX_INPUT_TOKENS\": str(INPUT_TOKENS), \n", " \"MAX_TOTAL_TOKENS\": str(INPUT_TOKENS + OUTPUT_TOKENS), \n", " \"DISABLE_CUSTOM_KERNELS\": 'false',\n", " \"MODEL_ID\": \"/repository\"\n", " },\n", " }\n", " \n", " secrets = {\n", " \"HF_TOKEN\": get_token()\n", " }\n", " \n", " # Creating the inference endpoint\n", " endpoint = create_inference_endpoint(\n", " name=name,\n", " namespace=namespace,\n", " repository=BASE_MODEL,\n", " framework=\"pytorch\",\n", " accelerator=\"gpu\",\n", " instance_size=\"x1\",\n", " instance_type=\"nvidia-l4\",\n", " region=\"us-east-1\",\n", " vendor=\"aws\",\n", " min_replica=8,\n", " max_replica=8,\n", " task=\"text-generation\",\n", " custom_image=custom_image,\n", " secrets=secrets\n", " )\n", " \n", " endpoint.wait()\n", " print(\"Your model is ready to use!\")\n", " return endpoint" ] }, { "cell_type": "code", "execution_count": 9, "id": "000b907a-224d-4dbf-aa0d-e0dbee1b8787", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Your model is ready to use!\n", "CPU times: user 8.29 ms, sys: 747 μs, total: 9.03 ms\n", "Wall time: 98.6 ms\n" ] } ], "source": [ "%%time\n", "endpoint = get_my_endpoint('prompt-order-experiment')" ] }, { "cell_type": "markdown", "id": "5708f348-c11e-4b66-aeff-93f5ec08ab49", "metadata": {}, "source": [ "## Manual Evaluation\n", "Since we havent seen our models in use yet, its a good time to check them out!" ] }, { "cell_type": "markdown", "id": "328dd842-eaa4-470c-b761-7c403b453321", "metadata": {}, "source": [ "### Reasoning Final Answer\n", "In both falcon and gpt3-5 we should see the **Reasoning** first and then the **Final Answer** in the prompt and the responses." ] }, { "cell_type": "code", "execution_count": 10, "id": "47af6191-7765-4047-bd8f-64aadb08434e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'content': 'Answer the Question and include your reasoning and the final answer in a json like: {\"reasoning\": , \"final_answer\": }.',\n", " 'role': 'system'},\n", " {'content': 'Question: What are busses used for?\\nAnswer Choices: (a) Protective shelter (b) Transporting humans (c) Help other species benefit (d) Transporting airplanes (e) A backbone (f) Communication (g) Safe operation (h) Safe driving',\n", " 'role': 'user'}]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "key = 'RFA-falcon'\n", "user_prompt_RFA = df.iloc[0][experiments[key]['column']]\n", "user_prompt_RFA" ] }, { "cell_type": "code", "execution_count": 11, "id": "1f976218-f33c-4db3-9797-3935e121e6b2", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'reasoning': 'Busses are primarily designed to transport people from one location to another. They are a common mode of public transportation used by many for commuting, school, work, and other activities. None of the other choices directly relate to the main function of a bus.',\n", " 'final_answer': 'b'}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response = endpoint.client.chat_completion(\n", " messages=user_prompt_RFA,\n", " max_tokens=INPUT_TOKENS + OUTPUT_TOKENS,\n", " model=experiments[key]['lora'],\n", " response_format={\"type\": \"json\", \"value\": experiments[key]['pydantic'].schema()},\n", ")\n", "json.loads(response.choices[0].message.content)" ] }, { "cell_type": "code", "execution_count": 12, "id": "222e33b7-0158-44f8-8848-da5318e699b4", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'reasoning': \"Busses are large vehicles designed to transport people from one place to another. They operate on fixed routes and schedules, offering a convenient mode of public transportation for many individuals. The choice of 'Transporting humans' best encapsulates the primary function of busses, as they are not intended for carrying other items or species, nor are they part of an airplane's structure.\",\n", " 'final_answer': 'b'}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "key = 'RFA-gpt3-5'\n", "response = endpoint.client.chat_completion(\n", " messages=user_prompt_RFA,\n", " max_tokens=INPUT_TOKENS + OUTPUT_TOKENS,\n", " model=experiments[key]['lora'],\n", " response_format={\"type\": \"json\", \"value\": experiments[key]['pydantic'].schema()},\n", ")\n", "json.loads(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "138fd46d-896f-4b4b-90d3-d3fd7075f149", "metadata": {}, "source": [ "### Final Answer Reasoning \n", "In both falcon and gpt3-5 we should see the **Final Answer** first and then the **Reasoning** in the prompt and the responses." ] }, { "cell_type": "code", "execution_count": 13, "id": "ec3e574b-f63f-4513-a6ae-335136543a8c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'content': 'Answer the Question and include your Final Answer and the Reasoning in a json like: {\"final_answer\": , \"reasoning\": }.',\n", " 'role': 'system'},\n", " {'content': 'Question: What are busses used for?\\nAnswer Choices: (a) Protective shelter (b) Transporting humans (c) Help other species benefit (d) Transporting airplanes (e) A backbone (f) Communication (g) Safe operation (h) Safe driving',\n", " 'role': 'user'}]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "key = 'FAR-gpt3-5'\n", "user_prompt_FAR = df.iloc[0][experiments[key]['column']]\n", "user_prompt_FAR" ] }, { "cell_type": "code", "execution_count": 14, "id": "24f30a15-5ec0-4f26-b32f-b4ccb429e6f9", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'final_answer': 'b',\n", " 'reasoning': 'Buses are vehicles primarily used for transporting humans from one place to another. They provide a convenient and efficient way for people to travel together on public transit. The other options are not accurate representations of the main purpose of a bus. Protective shelter, help other species benefit, transport airplanes, backbone, communication, safe operation, and safe driving are not the primary functions of a bus.'}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response = endpoint.client.chat_completion(\n", " messages=user_prompt_FAR,\n", " max_tokens=INPUT_TOKENS + OUTPUT_TOKENS,\n", " model=experiments[key]['lora'],\n", " response_format={\"type\": \"json\", \"value\": experiments[key]['pydantic'].schema()},\n", ")\n", "json.loads(response.choices[0].message.content)" ] }, { "cell_type": "code", "execution_count": 15, "id": "32536844-211d-4856-983c-d5787734d420", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'final_answer': 'b',\n", " 'reasoning': 'Busses are primarily used for transporting humans from one place to another, making option (b) the most accurate choice among the given answers.'}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "key = 'FAR-falcon'\n", "response = endpoint.client.chat_completion(\n", " messages=user_prompt_FAR,\n", " max_tokens=INPUT_TOKENS + OUTPUT_TOKENS,\n", " model=experiments[key]['lora'],\n", " response_format={\"type\": \"json\", \"value\": experiments[key]['pydantic'].schema()},\n", ")\n", "json.loads(response.choices[0].message.content)" ] }, { "cell_type": "markdown", "id": "5482cf12-880c-4112-ba87-059a03a3f466", "metadata": {}, "source": [ "### Final Answer \n", "Here we should juse see the **Final Answer** and no **Reasoning**." ] }, { "cell_type": "code", "execution_count": 16, "id": "71a9f634-319c-40c2-8f66-18e282732320", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'content': 'Answer the Question and include your Final Answer in a json like: {\"final_answer\": }.',\n", " 'role': 'system'},\n", " {'content': 'Question: What are busses used for?\\nAnswer Choices: (a) Protective shelter (b) Transporting humans (c) Help other species benefit (d) Transporting airplanes (e) A backbone (f) Communication (g) Safe operation (h) Safe driving',\n", " 'role': 'user'}]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "key = 'FA'\n", "user_prompt_FA = df.iloc[0][experiments[key]['column']]\n", "user_prompt_FA" ] }, { "cell_type": "code", "execution_count": 17, "id": "1cded37d-b907-4f4d-9b8c-c2167c6ba213", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'final_answer': 'b'}" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "response = endpoint.client.chat_completion(\n", " messages=user_prompt_FA,\n", " max_tokens=INPUT_TOKENS + OUTPUT_TOKENS,\n", " model=experiments[key]['lora'],\n", " response_format={\"type\": \"json\", \"value\": experiments[key]['pydantic'].schema()},\n", ")\n", "json.loads(response.choices[0].message.content)\n" ] }, { "cell_type": "markdown", "id": "122b8563-3220-43ac-9e0d-ef84ebcbb9e1", "metadata": {}, "source": [ "## Evaluation Loop\n", "I used 20x the prefill than the input and 8 replicas so I should capacity for ~160 parallel requests. Im only using 128 but it should be pretty fast." ] }, { "cell_type": "code", "execution_count": 18, "id": "379cd952-0ef4-41c4-be3b-5e7ac97e9d78", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "85d5f7f69b8c426f90ae2ac79935d1ee", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing requests: 0%| | 0/10098 [00:00= MAX_RETRIES:\n", " raise e # If we've exhausted retries, re-raise the error\n", " else:\n", " print(f\"Error: {e}. Retrying... ({retries}/{MAX_RETRIES})\")\n", " await asyncio.sleep(BACKOFF_TIME) # Wait before retrying\n", "\n", "# Function to process a single conversation type asynchronously\n", "async def process_conversation_type(conversation_type, model_info, df, tokenizer, async_client):\n", " response_column = f\"responses_{conversation_type.replace('-','_')}\"\n", " responses = [] # Temporary list to hold responses for the current conversation type\n", "\n", " tasks = []\n", " for _, item in df.iterrows():\n", " prompt = item.get(model_info[\"column\"])\n", " tasks.append(fetch_response_async(async_client, prompt, model_info[\"lora\"], model_info[\"pydantic\"]))\n", "\n", " # Wait for all tasks to complete\n", " responses = await asyncio.gather(*tasks)\n", "\n", " # If responses are strings, use them directly; otherwise, extract 'generated_text'\n", " try:\n", " df[response_column] = [resp[\"generated_text\"] for resp in responses]\n", " except TypeError: # Fallback in case responses are raw strings\n", " df[response_column] = responses\n", "\n", "# Main function to handle all conversation types\n", "async def main(df, models, tokenizer, async_client):\n", " global progress_bar\n", " total_requests = len(df) * len(models) # Total number of requests across all conversation types\n", " progress_bar = tqdm(total=total_requests, desc=\"Processing requests\")\n", "\n", " tasks = []\n", " for conversation_type, model_info in models.items():\n", " tasks.append(process_conversation_type(conversation_type, model_info, df, tokenizer, async_client))\n", " await asyncio.gather(*tasks)\n", "\n", " progress_bar.close() # Close the progress bar when done\n", "\n", "# Define parameters and run\n", "await main(df, experiments, tokenizer, endpoint.async_client)" ] }, { "cell_type": "markdown", "id": "d562e0b4-96ae-4259-925d-4d62b8c49641", "metadata": {}, "source": [ "It took `00:17:02`. Not bad! That should be around `$1.14` total at `$80/gpu/hr`." ] }, { "cell_type": "code", "execution_count": 28, "id": "8f81466e-80fb-4915-9c68-dfbe168e052b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "InferenceEndpoint(name='prompt-order-experiment', namespace='HF-test-lab', repository='tiiuae/Falcon3-7B-Instruct', status='paused', url=None)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "endpoint.pause()" ] }, { "cell_type": "code", "execution_count": 19, "id": "f5a79dad-2475-4324-8a2b-77b33e9c0822", "metadata": {}, "outputs": [], "source": [ "df_backup = df.copy()" ] }, { "cell_type": "code", "execution_count": 20, "id": "3473d555-927a-49ab-8e15-9097ed455c48", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
topicquestion_textanswer_keygpt3_5_reasoningfalcon_reasoninganswer_choicesuser_prompt_RFAconversation_RFA_gpt3_5conversation_RFA_falconuser_prompt_FARconversation_FAR_gpt3_5conversation_FAR_falconuser_prompt_FAconversation_FAresponses_RFA_falconresponses_FAR_falconresponses_RFA_gpt3_5responses_baseresponses_FAresponses_FAR_gpt3_5
0TransportationWhat are busses used for?ba) Protective shelter: This option is incorrec...(a) Protective shelter - \\nErroneous. Busses a...(a) Protective shelter (b) Transporting humans...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...{\"reasoning\": \"Busses are primarily used for t...{\"final_answer\": \"b\", \"reasoning\": \"Busses are...{\"reasoning\": \"Busses are vehicles used primar...{\"final_answer\": \"b\"}{\"final_answer\": \"b\"}{\"final_answer\": \"b\", \"reasoning\": \"Busses are...
1Climate changeWhich of the following does not contribute to ...ga) Nucleus of a cell: This option is not relat...(a) Nucleus of a cell: This option is incorrec...(a) Nucleus of a cell (b) Flying in a plane (c...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...{\"reasoning\": \"Nucleus of a cell (a) does not ...{\"final_answer\": \"a\", \"reasoning\": \"The nucleu...{\"reasoning\": \"The question asks which of the ...{\"final_answer\": \"g\"}{\"final_answer\": \"g\"}{\"final_answer\": \"d\", \"reasoning\": \"The questi...
2PhotographyWhat uses electrical energy converted from che...ba) Sunlight: Sunlight is a form of energy that...(a) Sunlight: Sunlight is a form of energy tha...(a) Sunlight (b) Cameras (c) Cells (d) Buses (...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...{\"reasoning\": \"Cells convert chemical energy s...{\"final_answer\": \"c\", \"reasoning\": \"Cells, spe...{\"reasoning\": \"Cells use electrical energy for...{\"final_answer\": \"c\"}{\"final_answer\": \"f\"}{\"final_answer\": \"f\", \"reasoning\": \"Cars use e...
3MicrobiologyBacteria causes what to be harmed?aNow, let's go through each option and explain ...1. **Plants (a) - Correct Answer:**\\n - Bact...(a) Plants (b) Electronics (c) Fossils (d) Hum...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...{\"reasoning\": \"Bacteria can cause harm to vari...{\"final_answer\": \"d\", \"reasoning\": \"Bacteria c...{\"reasoning\": \"Bacteria can harm various livin...{\"final_answer\": \"d\"}{\"final_answer\": \"d\"}{\"final_answer\": \"d\", \"reasoning\": \"The questi...
4BiologyPlants and snakes live _.?ab) Important habitats: This option is incorrec...**Answer: (a) Almost everywhere**\\n\\n**Explana...(a) Almost everywhere (b) Important habitats (...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...{\"reasoning\": \"Plants and snakes, as different...{\"final_answer\": \"g\", \"reasoning\": \"The correc...{\"reasoning\": \"Plants and snakes are both comm...{\"final_answer\": \"g\"}{\"final_answer\": \"a\"}{\"final_answer\": \"f\", \"reasoning\": \"Plants and...
...............................................................
1678BiologyNew resources required for creation can be red...ga) Mining: Mining involves extracting minerals...(a) Mining: Mining is the process of extractin...(a) Mining (b) Mutations (c) Fossil fuels (d) ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...{\"reasoning\": \"Recycling (g) is the option tha...{\"final_answer\": \"g\", \"reasoning\": \"Recycling ...{\"reasoning\": \"New resources required for crea...{\"final_answer\": \"g\"}{\"final_answer\": \"g\"}{\"final_answer\": \"g\", \"reasoning\": \"The correc...
1679BiologyA drought dehydrates an entire what?da) Body water: This option is incorrect becaus...The correct answer is (d) Environment. \\n\\nNow...(a) Body water (b) Dried fruit (c) Bodily wate...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...{\"reasoning\": \"A drought is a period of abnorm...{\"final_answer\": \"d\", \"reasoning\": \"A drought ...{\"reasoning\": \"Drought is a long-term lack of ...{\"final_answer\": \"d\"}{\"final_answer\": \"d\"}{\"final_answer\": \"d\", \"reasoning\": \"A drought ...
1680BiologyAn animal requires ingestion to do what?ea) Aerobic capacity: This option is not logica...(a) Aerobic capacity: This refers to an animal...(a) Aerobic capacity (b) Die (c) Water conserv...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...{\"reasoning\": \"Ingestion is the process of tak...{\"final_answer\": \"e\", \"reasoning\": \"Ingestion ...{\"reasoning\": \"Ingestion is the process by whi...{\"final_answer\": \"e\"}{\"final_answer\": \"c\"}{\"final_answer\": \"e\", \"reasoning\": \"Animals re...
1681BiologyUltraviolet light can cause what?ba) Ultraviolet light does not cause heat energ...Let's examine each option and determine why so...(a) Heat energy (b) Skin cancer (c) Killing in...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...{\"reasoning\": \"Ultraviolet light is known to h...{\"final_answer\": \"b\", \"reasoning\": \"Ultraviole...{\"reasoning\": \"Ultraviolet (UV) light is a typ...{\"final_answer\": \"b\"}{\"final_answer\": \"b\"}{\"final_answer\": \"b\", \"reasoning\": \"Ultraviole...
1682Physical activityWhat can increase a body's strength?ca) Four limbs: This option is not correct beca...(a) Four limbs: Having four limbs doesn't dire...(a) Four limbs (b) Disease (c) Running (d) Bic...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...{\"reasoning\": \"Among the choices provided, run...{\"final_answer\": \"c\", \"reasoning\": \"Running is...{\"reasoning\": \"Running involves physical activ...{\"final_answer\": \"c\"}{\"final_answer\": \"f\"}{\"final_answer\": \"c\", \"reasoning\": \"a) Four li...
\n", "

1683 rows × 20 columns

\n", "
" ], "text/plain": [ " topic question_text \\\n", "0 Transportation What are busses used for? \n", "1 Climate change Which of the following does not contribute to ... \n", "2 Photography What uses electrical energy converted from che... \n", "3 Microbiology Bacteria causes what to be harmed? \n", "4 Biology Plants and snakes live _.? \n", "... ... ... \n", "1678 Biology New resources required for creation can be red... \n", "1679 Biology A drought dehydrates an entire what? \n", "1680 Biology An animal requires ingestion to do what? \n", "1681 Biology Ultraviolet light can cause what? \n", "1682 Physical activity What can increase a body's strength? \n", "\n", " answer_key gpt3_5_reasoning \\\n", "0 b a) Protective shelter: This option is incorrec... \n", "1 g a) Nucleus of a cell: This option is not relat... \n", "2 b a) Sunlight: Sunlight is a form of energy that... \n", "3 a Now, let's go through each option and explain ... \n", "4 a b) Important habitats: This option is incorrec... \n", "... ... ... \n", "1678 g a) Mining: Mining involves extracting minerals... \n", "1679 d a) Body water: This option is incorrect becaus... \n", "1680 e a) Aerobic capacity: This option is not logica... \n", "1681 b a) Ultraviolet light does not cause heat energ... \n", "1682 c a) Four limbs: This option is not correct beca... \n", "\n", " falcon_reasoning \\\n", "0 (a) Protective shelter - \\nErroneous. Busses a... \n", "1 (a) Nucleus of a cell: This option is incorrec... \n", "2 (a) Sunlight: Sunlight is a form of energy tha... \n", "3 1. **Plants (a) - Correct Answer:**\\n - Bact... \n", "4 **Answer: (a) Almost everywhere**\\n\\n**Explana... \n", "... ... \n", "1678 (a) Mining: Mining is the process of extractin... \n", "1679 The correct answer is (d) Environment. \\n\\nNow... \n", "1680 (a) Aerobic capacity: This refers to an animal... \n", "1681 Let's examine each option and determine why so... \n", "1682 (a) Four limbs: Having four limbs doesn't dire... \n", "\n", " answer_choices \\\n", "0 (a) Protective shelter (b) Transporting humans... \n", "1 (a) Nucleus of a cell (b) Flying in a plane (c... \n", "2 (a) Sunlight (b) Cameras (c) Cells (d) Buses (... \n", "3 (a) Plants (b) Electronics (c) Fossils (d) Hum... \n", "4 (a) Almost everywhere (b) Important habitats (... \n", "... ... \n", "1678 (a) Mining (b) Mutations (c) Fossil fuels (d) ... \n", "1679 (a) Body water (b) Dried fruit (c) Bodily wate... \n", "1680 (a) Aerobic capacity (b) Die (c) Water conserv... \n", "1681 (a) Heat energy (b) Skin cancer (c) Killing in... \n", "1682 (a) Four limbs (b) Disease (c) Running (d) Bic... \n", "\n", " user_prompt_RFA \\\n", "0 [{'content': 'Answer the Question and include ... \n", "1 [{'content': 'Answer the Question and include ... \n", "2 [{'content': 'Answer the Question and include ... \n", "3 [{'content': 'Answer the Question and include ... \n", "4 [{'content': 'Answer the Question and include ... \n", "... ... \n", "1678 [{'content': 'Answer the Question and include ... \n", "1679 [{'content': 'Answer the Question and include ... \n", "1680 [{'content': 'Answer the Question and include ... \n", "1681 [{'content': 'Answer the Question and include ... \n", "1682 [{'content': 'Answer the Question and include ... \n", "\n", " conversation_RFA_gpt3_5 \\\n", "0 [{'content': 'Answer the Question and include ... \n", "1 [{'content': 'Answer the Question and include ... \n", "2 [{'content': 'Answer the Question and include ... \n", "3 [{'content': 'Answer the Question and include ... \n", "4 [{'content': 'Answer the Question and include ... \n", "... ... \n", "1678 [{'content': 'Answer the Question and include ... \n", "1679 [{'content': 'Answer the Question and include ... \n", "1680 [{'content': 'Answer the Question and include ... \n", "1681 [{'content': 'Answer the Question and include ... \n", "1682 [{'content': 'Answer the Question and include ... \n", "\n", " conversation_RFA_falcon \\\n", "0 [{'content': 'Answer the Question and include ... \n", "1 [{'content': 'Answer the Question and include ... \n", "2 [{'content': 'Answer the Question and include ... \n", "3 [{'content': 'Answer the Question and include ... \n", "4 [{'content': 'Answer the Question and include ... \n", "... ... \n", "1678 [{'content': 'Answer the Question and include ... \n", "1679 [{'content': 'Answer the Question and include ... \n", "1680 [{'content': 'Answer the Question and include ... \n", "1681 [{'content': 'Answer the Question and include ... \n", "1682 [{'content': 'Answer the Question and include ... \n", "\n", " user_prompt_FAR \\\n", "0 [{'content': 'Answer the Question and include ... \n", "1 [{'content': 'Answer the Question and include ... \n", "2 [{'content': 'Answer the Question and include ... \n", "3 [{'content': 'Answer the Question and include ... \n", "4 [{'content': 'Answer the Question and include ... \n", "... ... \n", "1678 [{'content': 'Answer the Question and include ... \n", "1679 [{'content': 'Answer the Question and include ... \n", "1680 [{'content': 'Answer the Question and include ... \n", "1681 [{'content': 'Answer the Question and include ... \n", "1682 [{'content': 'Answer the Question and include ... \n", "\n", " conversation_FAR_gpt3_5 \\\n", "0 [{'content': 'Answer the Question and include ... \n", "1 [{'content': 'Answer the Question and include ... \n", "2 [{'content': 'Answer the Question and include ... \n", "3 [{'content': 'Answer the Question and include ... \n", "4 [{'content': 'Answer the Question and include ... \n", "... ... \n", "1678 [{'content': 'Answer the Question and include ... \n", "1679 [{'content': 'Answer the Question and include ... \n", "1680 [{'content': 'Answer the Question and include ... \n", "1681 [{'content': 'Answer the Question and include ... \n", "1682 [{'content': 'Answer the Question and include ... \n", "\n", " conversation_FAR_falcon \\\n", "0 [{'content': 'Answer the Question and include ... \n", "1 [{'content': 'Answer the Question and include ... \n", "2 [{'content': 'Answer the Question and include ... \n", "3 [{'content': 'Answer the Question and include ... \n", "4 [{'content': 'Answer the Question and include ... \n", "... ... \n", "1678 [{'content': 'Answer the Question and include ... \n", "1679 [{'content': 'Answer the Question and include ... \n", "1680 [{'content': 'Answer the Question and include ... \n", "1681 [{'content': 'Answer the Question and include ... \n", "1682 [{'content': 'Answer the Question and include ... \n", "\n", " user_prompt_FA \\\n", "0 [{'content': 'Answer the Question and include ... \n", "1 [{'content': 'Answer the Question and include ... \n", "2 [{'content': 'Answer the Question and include ... \n", "3 [{'content': 'Answer the Question and include ... \n", "4 [{'content': 'Answer the Question and include ... \n", "... ... \n", "1678 [{'content': 'Answer the Question and include ... \n", "1679 [{'content': 'Answer the Question and include ... \n", "1680 [{'content': 'Answer the Question and include ... \n", "1681 [{'content': 'Answer the Question and include ... \n", "1682 [{'content': 'Answer the Question and include ... \n", "\n", " conversation_FA \\\n", "0 [{'content': 'Answer the Question and include ... \n", "1 [{'content': 'Answer the Question and include ... \n", "2 [{'content': 'Answer the Question and include ... \n", "3 [{'content': 'Answer the Question and include ... \n", "4 [{'content': 'Answer the Question and include ... \n", "... ... \n", "1678 [{'content': 'Answer the Question and include ... \n", "1679 [{'content': 'Answer the Question and include ... \n", "1680 [{'content': 'Answer the Question and include ... \n", "1681 [{'content': 'Answer the Question and include ... \n", "1682 [{'content': 'Answer the Question and include ... \n", "\n", " responses_RFA_falcon \\\n", "0 {\"reasoning\": \"Busses are primarily used for t... \n", "1 {\"reasoning\": \"Nucleus of a cell (a) does not ... \n", "2 {\"reasoning\": \"Cells convert chemical energy s... \n", "3 {\"reasoning\": \"Bacteria can cause harm to vari... \n", "4 {\"reasoning\": \"Plants and snakes, as different... \n", "... ... \n", "1678 {\"reasoning\": \"Recycling (g) is the option tha... \n", "1679 {\"reasoning\": \"A drought is a period of abnorm... \n", "1680 {\"reasoning\": \"Ingestion is the process of tak... \n", "1681 {\"reasoning\": \"Ultraviolet light is known to h... \n", "1682 {\"reasoning\": \"Among the choices provided, run... \n", "\n", " responses_FAR_falcon \\\n", "0 {\"final_answer\": \"b\", \"reasoning\": \"Busses are... \n", "1 {\"final_answer\": \"a\", \"reasoning\": \"The nucleu... \n", "2 {\"final_answer\": \"c\", \"reasoning\": \"Cells, spe... \n", "3 {\"final_answer\": \"d\", \"reasoning\": \"Bacteria c... \n", "4 {\"final_answer\": \"g\", \"reasoning\": \"The correc... \n", "... ... \n", "1678 {\"final_answer\": \"g\", \"reasoning\": \"Recycling ... \n", "1679 {\"final_answer\": \"d\", \"reasoning\": \"A drought ... \n", "1680 {\"final_answer\": \"e\", \"reasoning\": \"Ingestion ... \n", "1681 {\"final_answer\": \"b\", \"reasoning\": \"Ultraviole... \n", "1682 {\"final_answer\": \"c\", \"reasoning\": \"Running is... \n", "\n", " responses_RFA_gpt3_5 \\\n", "0 {\"reasoning\": \"Busses are vehicles used primar... \n", "1 {\"reasoning\": \"The question asks which of the ... \n", "2 {\"reasoning\": \"Cells use electrical energy for... \n", "3 {\"reasoning\": \"Bacteria can harm various livin... \n", "4 {\"reasoning\": \"Plants and snakes are both comm... \n", "... ... \n", "1678 {\"reasoning\": \"New resources required for crea... \n", "1679 {\"reasoning\": \"Drought is a long-term lack of ... \n", "1680 {\"reasoning\": \"Ingestion is the process by whi... \n", "1681 {\"reasoning\": \"Ultraviolet (UV) light is a typ... \n", "1682 {\"reasoning\": \"Running involves physical activ... \n", "\n", " responses_base responses_FA \\\n", "0 {\"final_answer\": \"b\"} {\"final_answer\": \"b\"} \n", "1 {\"final_answer\": \"g\"} {\"final_answer\": \"g\"} \n", "2 {\"final_answer\": \"c\"} {\"final_answer\": \"f\"} \n", "3 {\"final_answer\": \"d\"} {\"final_answer\": \"d\"} \n", "4 {\"final_answer\": \"g\"} {\"final_answer\": \"a\"} \n", "... ... ... \n", "1678 {\"final_answer\": \"g\"} {\"final_answer\": \"g\"} \n", "1679 {\"final_answer\": \"d\"} {\"final_answer\": \"d\"} \n", "1680 {\"final_answer\": \"e\"} {\"final_answer\": \"c\"} \n", "1681 {\"final_answer\": \"b\"} {\"final_answer\": \"b\"} \n", "1682 {\"final_answer\": \"c\"} {\"final_answer\": \"f\"} \n", "\n", " responses_FAR_gpt3_5 \n", "0 {\"final_answer\": \"b\", \"reasoning\": \"Busses are... \n", "1 {\"final_answer\": \"d\", \"reasoning\": \"The questi... \n", "2 {\"final_answer\": \"f\", \"reasoning\": \"Cars use e... \n", "3 {\"final_answer\": \"d\", \"reasoning\": \"The questi... \n", "4 {\"final_answer\": \"f\", \"reasoning\": \"Plants and... \n", "... ... \n", "1678 {\"final_answer\": \"g\", \"reasoning\": \"The correc... \n", "1679 {\"final_answer\": \"d\", \"reasoning\": \"A drought ... \n", "1680 {\"final_answer\": \"e\", \"reasoning\": \"Animals re... \n", "1681 {\"final_answer\": \"b\", \"reasoning\": \"Ultraviole... \n", "1682 {\"final_answer\": \"c\", \"reasoning\": \"a) Four li... \n", "\n", "[1683 rows x 20 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 21, "id": "8619f9f5-9fe4-433e-b524-51c2b12e8d12", "metadata": {}, "outputs": [], "source": [ "def extract_final_answer(response):\n", " try:\n", " answer = json.loads(response).get(\"final_answer\")\n", " except:\n", " answer = 'x'\n", " return answer\n", "\n", "# Create new columns for predictions\n", "df['predictions_base'] = df['responses_base'].apply(extract_final_answer)\n", "df['predictions_FA'] = df['responses_FA'].apply(extract_final_answer)\n", "df['predictions_RFA_falcon'] = df['responses_RFA_falcon'].apply(extract_final_answer)\n", "df['predictions_FAR_falcon'] = df['responses_FAR_falcon'].apply(extract_final_answer)\n", "df['predictions_RFA_gpt3_5'] = df['responses_RFA_gpt3_5'].apply(extract_final_answer)\n", "df['predictions_FAR_gpt3_5'] = df['responses_FAR_gpt3_5'].apply(extract_final_answer)\n" ] }, { "cell_type": "code", "execution_count": 22, "id": "271b402f-6696-4a9c-94ba-a6071c0c2252", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'predictions_base': 0.0,\n", " 'predictions_FA': 0.0,\n", " 'predictions_RFA_falcon': 0.0,\n", " 'predictions_FAR_falcon': 0.0,\n", " 'predictions_RFA_gpt3_5': 0.0,\n", " 'predictions_FAR_gpt3_5': 0.0}" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prediction_cols = ['predictions_base',\n", " 'predictions_FA',\n", " 'predictions_RFA_falcon',\n", " 'predictions_FAR_falcon',\n", " 'predictions_RFA_gpt3_5',\n", " 'predictions_FAR_gpt3_5']\n", "percentages = {\n", " col: (df[col] == 'x').mean() * 100\n", " for col in prediction_cols\n", "}\n", "percentages" ] }, { "cell_type": "code", "execution_count": 25, "id": "938cf2a3-2fed-42a3-82ec-a56cb0ea9f37", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Base: \t\t\t\t\t\t55.02%\n", "Final Answer: \t\t\t\t\t53.71%\n", "Reasoning and then the Final Answer (Falcon): \t56.98%\n", "Final Answer and then the Reasoning (Falcon): \t54.37%\n", "Reasoning and then the Final Answer (GPT-3.5): \t57.52%\n", "Final Answer and then the Reasoning (GPT-3.5): \t56.21%\n" ] } ], "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "print(f\"Base: \\t\\t\\t\\t\\t\\t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_base']) * 100, 2)}%\")\n", "print(f\"Final Answer: \\t\\t\\t\\t\\t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_FA']) * 100, 2)}%\")\n", "print(f\"Reasoning and then the Final Answer (Falcon): \\t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_RFA_falcon']) * 100, 2)}%\")\n", "print(f\"Final Answer and then the Reasoning (Falcon): \\t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_FAR_falcon']) * 100, 2)}%\")\n", "print(f\"Reasoning and then the Final Answer (GPT-3.5): \\t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_RFA_gpt3_5']) * 100, 2)}%\")\n", "print(f\"Final Answer and then the Reasoning (GPT-3.5): \\t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_FAR_gpt3_5']) * 100, 2)}%\")" ] }, { "cell_type": "code", "execution_count": 26, "id": "83aae472-513b-43c3-9ee8-64d4cda775e0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
topicquestion_textanswer_keygpt3_5_reasoningfalcon_reasoninganswer_choicesuser_prompt_RFAconversation_RFA_gpt3_5conversation_RFA_falconuser_prompt_FAR...responses_RFA_gpt3_5responses_baseresponses_FAresponses_FAR_gpt3_5predictions_basepredictions_FApredictions_RFA_falconpredictions_FAR_falconpredictions_RFA_gpt3_5predictions_FAR_gpt3_5
0TransportationWhat are busses used for?ba) Protective shelter: This option is incorrec...(a) Protective shelter - \\nErroneous. Busses a...(a) Protective shelter (b) Transporting humans...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ......{\"reasoning\": \"Busses are vehicles used primar...{\"final_answer\": \"b\"}{\"final_answer\": \"b\"}{\"final_answer\": \"b\", \"reasoning\": \"Busses are...bbbbbb
1Climate changeWhich of the following does not contribute to ...ga) Nucleus of a cell: This option is not relat...(a) Nucleus of a cell: This option is incorrec...(a) Nucleus of a cell (b) Flying in a plane (c...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ......{\"reasoning\": \"The question asks which of the ...{\"final_answer\": \"g\"}{\"final_answer\": \"g\"}{\"final_answer\": \"d\", \"reasoning\": \"The questi...ggaagd
2PhotographyWhat uses electrical energy converted from che...ba) Sunlight: Sunlight is a form of energy that...(a) Sunlight: Sunlight is a form of energy tha...(a) Sunlight (b) Cameras (c) Cells (d) Buses (...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ......{\"reasoning\": \"Cells use electrical energy for...{\"final_answer\": \"c\"}{\"final_answer\": \"f\"}{\"final_answer\": \"f\", \"reasoning\": \"Cars use e...cfcccf
3MicrobiologyBacteria causes what to be harmed?aNow, let's go through each option and explain ...1. **Plants (a) - Correct Answer:**\\n - Bact...(a) Plants (b) Electronics (c) Fossils (d) Hum...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ......{\"reasoning\": \"Bacteria can harm various livin...{\"final_answer\": \"d\"}{\"final_answer\": \"d\"}{\"final_answer\": \"d\", \"reasoning\": \"The questi...dddddd
4BiologyPlants and snakes live _.?ab) Important habitats: This option is incorrec...**Answer: (a) Almost everywhere**\\n\\n**Explana...(a) Almost everywhere (b) Important habitats (...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ......{\"reasoning\": \"Plants and snakes are both comm...{\"final_answer\": \"g\"}{\"final_answer\": \"a\"}{\"final_answer\": \"f\", \"reasoning\": \"Plants and...gaagaf
..................................................................
1678BiologyNew resources required for creation can be red...ga) Mining: Mining involves extracting minerals...(a) Mining: Mining is the process of extractin...(a) Mining (b) Mutations (c) Fossil fuels (d) ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ......{\"reasoning\": \"New resources required for crea...{\"final_answer\": \"g\"}{\"final_answer\": \"g\"}{\"final_answer\": \"g\", \"reasoning\": \"The correc...gggggg
1679BiologyA drought dehydrates an entire what?da) Body water: This option is incorrect becaus...The correct answer is (d) Environment. \\n\\nNow...(a) Body water (b) Dried fruit (c) Bodily wate...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ......{\"reasoning\": \"Drought is a long-term lack of ...{\"final_answer\": \"d\"}{\"final_answer\": \"d\"}{\"final_answer\": \"d\", \"reasoning\": \"A drought ...dddddd
1680BiologyAn animal requires ingestion to do what?ea) Aerobic capacity: This option is not logica...(a) Aerobic capacity: This refers to an animal...(a) Aerobic capacity (b) Die (c) Water conserv...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ......{\"reasoning\": \"Ingestion is the process by whi...{\"final_answer\": \"e\"}{\"final_answer\": \"c\"}{\"final_answer\": \"e\", \"reasoning\": \"Animals re...eceeee
1681BiologyUltraviolet light can cause what?ba) Ultraviolet light does not cause heat energ...Let's examine each option and determine why so...(a) Heat energy (b) Skin cancer (c) Killing in...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ......{\"reasoning\": \"Ultraviolet (UV) light is a typ...{\"final_answer\": \"b\"}{\"final_answer\": \"b\"}{\"final_answer\": \"b\", \"reasoning\": \"Ultraviole...bbgbfb
1682Physical activityWhat can increase a body's strength?ca) Four limbs: This option is not correct beca...(a) Four limbs: Having four limbs doesn't dire...(a) Four limbs (b) Disease (c) Running (d) Bic...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ...[{'content': 'Answer the Question and include ......{\"reasoning\": \"Running involves physical activ...{\"final_answer\": \"c\"}{\"final_answer\": \"f\"}{\"final_answer\": \"c\", \"reasoning\": \"a) Four li...cfcccc
\n", "

1683 rows × 26 columns

\n", "
" ], "text/plain": [ " topic question_text \\\n", "0 Transportation What are busses used for? \n", "1 Climate change Which of the following does not contribute to ... \n", "2 Photography What uses electrical energy converted from che... \n", "3 Microbiology Bacteria causes what to be harmed? \n", "4 Biology Plants and snakes live _.? \n", "... ... ... \n", "1678 Biology New resources required for creation can be red... \n", "1679 Biology A drought dehydrates an entire what? \n", "1680 Biology An animal requires ingestion to do what? \n", "1681 Biology Ultraviolet light can cause what? \n", "1682 Physical activity What can increase a body's strength? \n", "\n", " answer_key gpt3_5_reasoning \\\n", "0 b a) Protective shelter: This option is incorrec... \n", "1 g a) Nucleus of a cell: This option is not relat... \n", "2 b a) Sunlight: Sunlight is a form of energy that... \n", "3 a Now, let's go through each option and explain ... \n", "4 a b) Important habitats: This option is incorrec... \n", "... ... ... \n", "1678 g a) Mining: Mining involves extracting minerals... \n", "1679 d a) Body water: This option is incorrect becaus... \n", "1680 e a) Aerobic capacity: This option is not logica... \n", "1681 b a) Ultraviolet light does not cause heat energ... \n", "1682 c a) Four limbs: This option is not correct beca... \n", "\n", " falcon_reasoning \\\n", "0 (a) Protective shelter - \\nErroneous. Busses a... \n", "1 (a) Nucleus of a cell: This option is incorrec... \n", "2 (a) Sunlight: Sunlight is a form of energy tha... \n", "3 1. **Plants (a) - Correct Answer:**\\n - Bact... \n", "4 **Answer: (a) Almost everywhere**\\n\\n**Explana... \n", "... ... \n", "1678 (a) Mining: Mining is the process of extractin... \n", "1679 The correct answer is (d) Environment. \\n\\nNow... \n", "1680 (a) Aerobic capacity: This refers to an animal... \n", "1681 Let's examine each option and determine why so... \n", "1682 (a) Four limbs: Having four limbs doesn't dire... \n", "\n", " answer_choices \\\n", "0 (a) Protective shelter (b) Transporting humans... \n", "1 (a) Nucleus of a cell (b) Flying in a plane (c... \n", "2 (a) Sunlight (b) Cameras (c) Cells (d) Buses (... \n", "3 (a) Plants (b) Electronics (c) Fossils (d) Hum... \n", "4 (a) Almost everywhere (b) Important habitats (... \n", "... ... \n", "1678 (a) Mining (b) Mutations (c) Fossil fuels (d) ... \n", "1679 (a) Body water (b) Dried fruit (c) Bodily wate... \n", "1680 (a) Aerobic capacity (b) Die (c) Water conserv... \n", "1681 (a) Heat energy (b) Skin cancer (c) Killing in... \n", "1682 (a) Four limbs (b) Disease (c) Running (d) Bic... \n", "\n", " user_prompt_RFA \\\n", "0 [{'content': 'Answer the Question and include ... \n", "1 [{'content': 'Answer the Question and include ... \n", "2 [{'content': 'Answer the Question and include ... \n", "3 [{'content': 'Answer the Question and include ... \n", "4 [{'content': 'Answer the Question and include ... \n", "... ... \n", "1678 [{'content': 'Answer the Question and include ... \n", "1679 [{'content': 'Answer the Question and include ... \n", "1680 [{'content': 'Answer the Question and include ... \n", "1681 [{'content': 'Answer the Question and include ... \n", "1682 [{'content': 'Answer the Question and include ... \n", "\n", " conversation_RFA_gpt3_5 \\\n", "0 [{'content': 'Answer the Question and include ... \n", "1 [{'content': 'Answer the Question and include ... \n", "2 [{'content': 'Answer the Question and include ... \n", "3 [{'content': 'Answer the Question and include ... \n", "4 [{'content': 'Answer the Question and include ... \n", "... ... \n", "1678 [{'content': 'Answer the Question and include ... \n", "1679 [{'content': 'Answer the Question and include ... \n", "1680 [{'content': 'Answer the Question and include ... \n", "1681 [{'content': 'Answer the Question and include ... \n", "1682 [{'content': 'Answer the Question and include ... \n", "\n", " conversation_RFA_falcon \\\n", "0 [{'content': 'Answer the Question and include ... \n", "1 [{'content': 'Answer the Question and include ... \n", "2 [{'content': 'Answer the Question and include ... \n", "3 [{'content': 'Answer the Question and include ... \n", "4 [{'content': 'Answer the Question and include ... \n", "... ... \n", "1678 [{'content': 'Answer the Question and include ... \n", "1679 [{'content': 'Answer the Question and include ... \n", "1680 [{'content': 'Answer the Question and include ... \n", "1681 [{'content': 'Answer the Question and include ... \n", "1682 [{'content': 'Answer the Question and include ... \n", "\n", " user_prompt_FAR ... \\\n", "0 [{'content': 'Answer the Question and include ... ... \n", "1 [{'content': 'Answer the Question and include ... ... \n", "2 [{'content': 'Answer the Question and include ... ... \n", "3 [{'content': 'Answer the Question and include ... ... \n", "4 [{'content': 'Answer the Question and include ... ... \n", "... ... ... \n", "1678 [{'content': 'Answer the Question and include ... ... \n", "1679 [{'content': 'Answer the Question and include ... ... \n", "1680 [{'content': 'Answer the Question and include ... ... \n", "1681 [{'content': 'Answer the Question and include ... ... \n", "1682 [{'content': 'Answer the Question and include ... ... \n", "\n", " responses_RFA_gpt3_5 \\\n", "0 {\"reasoning\": \"Busses are vehicles used primar... \n", "1 {\"reasoning\": \"The question asks which of the ... \n", "2 {\"reasoning\": \"Cells use electrical energy for... \n", "3 {\"reasoning\": \"Bacteria can harm various livin... \n", "4 {\"reasoning\": \"Plants and snakes are both comm... \n", "... ... \n", "1678 {\"reasoning\": \"New resources required for crea... \n", "1679 {\"reasoning\": \"Drought is a long-term lack of ... \n", "1680 {\"reasoning\": \"Ingestion is the process by whi... \n", "1681 {\"reasoning\": \"Ultraviolet (UV) light is a typ... \n", "1682 {\"reasoning\": \"Running involves physical activ... \n", "\n", " responses_base responses_FA \\\n", "0 {\"final_answer\": \"b\"} {\"final_answer\": \"b\"} \n", "1 {\"final_answer\": \"g\"} {\"final_answer\": \"g\"} \n", "2 {\"final_answer\": \"c\"} {\"final_answer\": \"f\"} \n", "3 {\"final_answer\": \"d\"} {\"final_answer\": \"d\"} \n", "4 {\"final_answer\": \"g\"} {\"final_answer\": \"a\"} \n", "... ... ... \n", "1678 {\"final_answer\": \"g\"} {\"final_answer\": \"g\"} \n", "1679 {\"final_answer\": \"d\"} {\"final_answer\": \"d\"} \n", "1680 {\"final_answer\": \"e\"} {\"final_answer\": \"c\"} \n", "1681 {\"final_answer\": \"b\"} {\"final_answer\": \"b\"} \n", "1682 {\"final_answer\": \"c\"} {\"final_answer\": \"f\"} \n", "\n", " responses_FAR_gpt3_5 predictions_base \\\n", "0 {\"final_answer\": \"b\", \"reasoning\": \"Busses are... b \n", "1 {\"final_answer\": \"d\", \"reasoning\": \"The questi... g \n", "2 {\"final_answer\": \"f\", \"reasoning\": \"Cars use e... c \n", "3 {\"final_answer\": \"d\", \"reasoning\": \"The questi... d \n", "4 {\"final_answer\": \"f\", \"reasoning\": \"Plants and... g \n", "... ... ... \n", "1678 {\"final_answer\": \"g\", \"reasoning\": \"The correc... g \n", "1679 {\"final_answer\": \"d\", \"reasoning\": \"A drought ... d \n", "1680 {\"final_answer\": \"e\", \"reasoning\": \"Animals re... e \n", "1681 {\"final_answer\": \"b\", \"reasoning\": \"Ultraviole... b \n", "1682 {\"final_answer\": \"c\", \"reasoning\": \"a) Four li... c \n", "\n", " predictions_FA predictions_RFA_falcon predictions_FAR_falcon \\\n", "0 b b b \n", "1 g a a \n", "2 f c c \n", "3 d d d \n", "4 a a g \n", "... ... ... ... \n", "1678 g g g \n", "1679 d d d \n", "1680 c e e \n", "1681 b g b \n", "1682 f c c \n", "\n", " predictions_RFA_gpt3_5 predictions_FAR_gpt3_5 \n", "0 b b \n", "1 g d \n", "2 c f \n", "3 d d \n", "4 a f \n", "... ... ... \n", "1678 g g \n", "1679 d d \n", "1680 e e \n", "1681 f b \n", "1682 c c \n", "\n", "[1683 rows x 26 columns]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 27, "id": "45c08dd4-0b98-4e0f-b487-549f60518a4e", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4a4a20a26dc649a3a6cebdb7856fbd4b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Uploading the dataset shards: 0%| | 0/1 [00:00