Christian H. Cooper commited on
Commit
f8a1b9b
·
0 Parent(s):
Files changed (8) hide show
  1. .clinerules +122 -0
  2. .gitignore +146 -0
  3. .huggingface-space +9 -0
  4. README.md +33 -0
  5. app.py +589 -0
  6. instructions.txt +480 -0
  7. requirements.txt +43 -0
  8. tools.py +864 -0
.clinerules ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Guidelines for Creating and Utilizing Tools in tools.py:
2
+
3
+ 1. Initial Assessment:
4
+ - Before creating new tools, read through tools.py to understand the existing tools and their functionalities.
5
+
6
+ 2. Tool Creation:
7
+ - Create new tools as functions within tools.py. If tools.py doesn't exist, create it.
8
+ - Ensure tools are designed to be imported and executed via terminal commands, not run directly.
9
+
10
+ 3. Function Design:
11
+ - Develop tools for tasks requiring precision or those not easily executable manually.
12
+ - Make tools generalizable to handle a wide range of inputs, ensuring reusability for future tasks.
13
+ - For example, instead of creating a function for a specific stock or URL, design it to accept any stock ticker or URL as an argument.
14
+ - Name functions to reflect their general nature, ensuring they are not limited to a specific use case. This enhances flexibility and adaptability for future applications.
15
+
16
+ 4. Output:
17
+ - Tools must always print their output.
18
+
19
+ 5. Execution:
20
+ - Do not run tools.py directly. Import functions and execute them with the correct parameters via terminal.
21
+ - Always use the `python -c "..."` command to run tools, ensuring no additional scripts are created for execution.
22
+
23
+ 6. Generalization:
24
+ - Thoroughly assess the potential range of inputs and design functions to accommodate the broadest possible spectrum of arguments.
25
+ - Design functions to accept parameters that cover the most general cases, allowing them to handle a wide variety of scenarios.
26
+ - Ensure that functions can handle various data types and structures, allowing for maximum flexibility and adaptability.
27
+ - If a request involves distinct tasks, create separate functions for each to maintain clarity and modularity.
28
+ - Regularly review and refactor functions to enhance their generalization capabilities as new requirements emerge.
29
+
30
+ 7. Error Handling:
31
+ - If errors occur, rewrite functions to resolve them.
32
+
33
+ 8. Script Management:
34
+ - Never delete existing content in tools.py as it is a standing script used by the system. You can add tools to create your own functionality and this is encouraged.
35
+ - All new functionality should be executed via python -c commands rather than modifying tools.py.
36
+ - Avoid creating additional .py scripts for function execution. Always import and run with proper arguments using the `python -c "..."` command.
37
+
38
+ 9. Post-Creation:
39
+ - After creating tools, execute them to fulfill user requests unless the request was solely for tool creation.
40
+
41
+ # Git Smart Clone Instructions
42
+
43
+ ## Overview
44
+ This enhanced git cloning system provides automatic code visualization and context extraction for any public repository. When you clone a repository using `git-smartclone`, it will:
45
+
46
+ 1. Clone the repository normally
47
+ 2. Generate interactive HTML flowcharts for all Python files
48
+ 3. Extract repository context using gitingest.com
49
+
50
+ ## Installation
51
+ 1. Ensure you have Python 3.x installed
52
+ 2. Install required packages:
53
+ ```bash
54
+ pip install pyflowchart requests gitpython
55
+ ```
56
+ 3. Add the ProjectTemplates directory to your PATH
57
+ 4. Copy git-smartclone.ps1 to your ProjectTemplates directory
58
+
59
+ ## Usage
60
+ Instead of using regular `git clone`, use:
61
+ ```powershell
62
+ git-smartclone <repository-url>
63
+ ```
64
+
65
+ Example:
66
+ ```powershell
67
+ git-smartclone https://github.com/username/repo.git
68
+ ```
69
+
70
+ ## What You Get
71
+ After cloning, you'll find:
72
+
73
+ 1. The cloned repository in your current directory
74
+ 2. A `flowcharts` directory inside the repository containing:
75
+ - Interactive HTML flowcharts for each Python file
76
+ - Open these in your browser to see visual code representations
77
+ - Click elements to explore the code structure
78
+ - Export options for PNG/SVG if needed
79
+
80
+ 3. A `{repo-name}_context.txt` file containing:
81
+ - Repository context from gitingest.com
82
+ - Code architecture insights
83
+ - Key file and directory explanations
84
+
85
+ ## Viewing Results
86
+ 1. Flowcharts:
87
+ - Navigate to the `flowcharts` directory
88
+ - Open any `*_flowchart.html` file in your browser
89
+ - Interactive elements allow you to:
90
+ * Zoom in/out
91
+ * Pan around
92
+ * Click nodes to see details
93
+ * Export as PNG/SVG
94
+
95
+ 2. Repository Context:
96
+ - Open `{repo-name}_context.txt`
97
+ - Contains AI-generated insights about the codebase
98
+ - Helps understand the repository structure
99
+
100
+ ## Benefits
101
+ - Instant code visualization
102
+ - Better understanding of code flow
103
+ - Quick repository context
104
+ - Time-saving code exploration
105
+ - Enhanced code comprehension
106
+
107
+ ## Notes
108
+ - Works best with Python repositories
109
+ - Requires internet connection for gitingest.com
110
+ - Large repositories may take longer to process
111
+ - Empty Python files are automatically skipped
112
+
113
+ ## Troubleshooting
114
+ If flowcharts aren't generating:
115
+ 1. Ensure pyflowchart is installed: `pip install pyflowchart`
116
+ 2. Check Python file isn't empty
117
+ 3. Verify file has valid Python syntax
118
+
119
+ If context extraction fails:
120
+ 1. Verify repository URL is public
121
+ 2. Check internet connection
122
+ 3. Ensure URL is from github.com
.gitignore ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Virtual environments
10
+ # Uncomment the one you use or add your own
11
+ # If using virtualenv or venv:
12
+ venv/
13
+ env/
14
+ # If using Pipenv:
15
+ Pipenv/
16
+ # If using poetry:
17
+ .poetry/
18
+ # If using conda:
19
+ envs/
20
+ .conda/
21
+ # If using virtualenvwrapper:
22
+ .venv/
23
+
24
+ # Distribution / packaging
25
+ build/
26
+ develop-eggs/
27
+ dist/
28
+ downloads/
29
+ eggs/
30
+ .eggs/
31
+ lib/
32
+ lib64/
33
+ parts/
34
+ sdist/
35
+ var/
36
+ *.egg-info/
37
+ .installed.cfg
38
+ *.egg
39
+
40
+ # Installer logs
41
+ pip-log.txt
42
+ pip-delete-this-directory.txt
43
+
44
+ # PyInstaller
45
+ # Usually these files are written by a python script from a template
46
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
47
+ *.manifest
48
+ *.spec
49
+
50
+ # Unit test / coverage reports
51
+ htmlcov/
52
+ .tox/
53
+ .nox/
54
+ .coverage
55
+ .coverage.*
56
+ .cache
57
+ nosetests.xml
58
+ coverage.xml
59
+ *.cover
60
+ *.py,cover
61
+ .hypothesis/
62
+ .pytest_cache/
63
+
64
+ # Translations
65
+ *.mo
66
+ *.pot
67
+
68
+ # Django stuff:
69
+ *.log
70
+ local_settings.py
71
+ db.sqlite3
72
+ db.sqlite3-journal
73
+
74
+ # Flask stuff:
75
+ instance/
76
+ .webassets-cache
77
+
78
+ # Scrapy stuff:
79
+ .scrapy
80
+
81
+ # Sphinx documentation
82
+ docs/_build/
83
+
84
+ # PyBuilder
85
+ target/
86
+
87
+ # Jupyter Notebook
88
+ .ipynb_checkpoints
89
+
90
+ # IPython
91
+ profile_default/
92
+ ipython_config.py
93
+
94
+ # pyenv
95
+ .python-version
96
+
97
+ # celery beat schedule file
98
+ celerybeat-schedule
99
+
100
+ # SageMath parsed files
101
+ *.sage.py
102
+
103
+ # Environments
104
+ .env
105
+ .venv
106
+ env/
107
+ venv/
108
+ ENV/
109
+ env.bak/
110
+ venv.bak/
111
+
112
+ # Spyder project settings
113
+ .spyderproject
114
+ .spyproject
115
+
116
+ # Rope project settings
117
+ .ropeproject
118
+
119
+ # mkdocs documentation
120
+ /site
121
+
122
+ # mypy
123
+ .mypy_cache/
124
+ .dmypy.json
125
+ dmypy.json
126
+
127
+ # Pyre type checker
128
+ .pyre/
129
+
130
+ # VSCode settings
131
+ .vscode/
132
+
133
+ # PyCharm settings
134
+ .idea/
135
+
136
+ # MacOS files
137
+ .DS_Store
138
+
139
+ # Windows thumbnail cache
140
+ Thumbs.db
141
+
142
+ # Optional: Ignore coverage reports
143
+ coverage/
144
+
145
+ # Optional: Ignore node_modules if using frontend tools
146
+ node_modules/
.huggingface-space ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ title: Ask About Stoney
2
+ emoji: 🗣️
3
+ colorFrom: blue
4
+ colorTo: indigo
5
+ sdk: gradio
6
+ sdk_version: 4.19.2
7
+ python_version: 3.10
8
+ app_file: app.py
9
+ pinned: false
README.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ask About Stoney - Interactive AI Assistant
2
+
3
+ This Gradio app provides an interactive interface to learn about and discuss the revolutionary Stoney language preservation project by Christian H. Cooper. The app uses the Gemini AI model to provide detailed responses about the project's approach to language preservation using AI and community involvement.
4
+
5
+ ## Features
6
+
7
+ - Automatic initial analysis of the Stoney language project
8
+ - Interactive chat interface for follow-up questions
9
+ - Streaming responses for real-time interaction
10
+ - Example questions to get started
11
+
12
+ ## Usage
13
+
14
+ Simply open the app to see an initial analysis of the project. You can then ask follow-up questions about any aspect of the project, including:
15
+ - The Community-in-the-Loop approach
16
+ - Technical implementation details
17
+ - Implications for language preservation
18
+ - Cultural impact and considerations
19
+
20
+ ## Technology
21
+
22
+ Built with:
23
+ - Gradio - For the interactive web interface
24
+ - Google Gemini AI - For generating responses
25
+ - Python - Core application logic
26
+
27
+ ## Deployment
28
+
29
+ This app is deployed on Hugging Face Spaces, providing easy access and reliable performance.
30
+
31
+ ## About the Project
32
+
33
+ This interface provides insights into Christian H. Cooper's groundbreaking work on preserving the Stoney Nakoda language through AI and community involvement. The project demonstrates how modern AI techniques can be leveraged for language preservation while keeping the community at the center of the process.
app.py ADDED
@@ -0,0 +1,589 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import gradio as gr
3
+ import google.generativeai as genai
4
+ from dotenv import load_dotenv
5
+
6
+ # Load environment variables
7
+ load_dotenv()
8
+
9
+ # Configure Gemini
10
+ genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
11
+ model = genai.GenerativeModel('gemini-2.0-flash-thinking-exp-01-21')
12
+
13
+ # The initial prompt that will be hidden from users
14
+ INITIAL_PROMPT = """based on the totality of your training, how revolutionary is this project by Christian H. Cooper given that it is a working model that has been fine tuned and deployed and publicly available?
15
+
16
+ # From Whispers to Voices: A "Community-In-The-Loop" Proposal for Model Distillation and Language Preservation
17
+
18
+ New Years Day, 2025
19
+
20
+ A working model of the Stoney Nakoda language has been developed and is now available for community-in-the-loop testing in 2025:
21
+
22
+ - **Model App**: [Stoney Language Model App](https://huggingface.co/spaces/HarleyCooper/StoneyApp)
23
+ - **Training Data**: [StoneyNakoda Training Dataset](https://huggingface.co/datasets/HarleyCooper/StoneyNakoda/blob/main/zSTONEY1_TRAINING_SET.jsonl)
24
+
25
+ Any First Nations community seeking to apply this approach to their own language is warmly invited to reach out.
26
+
27
+ By following this code, you can build a model for any low-resource language. The starting dictionary size should be ~8,000 words.
28
+
29
+ ---
30
+
31
+ ## Table of Contents
32
+
33
+ 1. [New Years Day, Canadian Rockies, 2025](#introduction)
34
+ 2. [Understanding How AI Learns Stoney Words Using Cosine Similarity](#understanding-how-ai-learns-stoney-words-using-cosine-similarity)
35
+ 3. [Project Architecture](#project-architecture)
36
+ - [High-Level System Design](#high-level-system-design)
37
+ - [Data Flow](#data-flow)
38
+ 4. [Detailed Project Structure](#detailed-project-structure)
39
+ 5. [Core Components](#core-components)
40
+ - [Data Generation & Processing](#data-generation--processing)
41
+ - [Model Training](#model-training)
42
+ 6. [Comprehensive Setup Instructions](#comprehensive-setup-instructions)
43
+ - [System Requirements](#system-requirements)
44
+ - [Environment Setup](#environment-setup)
45
+ - [Configuration](#configuration)
46
+ - [Initialization](#initialization)
47
+ 7. [Detailed Usage Pipeline](#detailed-usage-pipeline)
48
+ 1. [Generate Training Data](#1-generate-training-data)
49
+ 2. [Prepare Fine-tuning Data](#2-prepare-fine-tuning-data)
50
+ 3. [Fine-tune Model](#3-fine-tune-model)
51
+ 8. [Advanced Model Configuration](#advanced-model-configuration)
52
+ - [OpenAI Models](#openai-models)
53
+ - [Google Gemini](#google-gemini)
54
+ - [Hyperparameters](#hyperparameters)
55
+ 9. [Comprehensive Data Formats](#comprehensive-data-formats)
56
+ - [Dictionary Format](#dictionary-format)
57
+ - [Q&A Format](#qa-format)
58
+ - [OpenAI Training Format](#openai-training-format)
59
+ 10. [Development Guidelines](#development-guidelines)
60
+ 11. [Contributing](#contributing)
61
+ 12. [License](#license)
62
+ 13. [Acknowledgments](#acknowledgments)
63
+ 14. [The Community-in-the-Loop Revolution](#the-community-in-the-loop-revolution)
64
+ - [Introduction](#introduction-1)
65
+ - [Conceptual Overview](#conceptual-overview)
66
+ - [Heart of the Approach](#heart-of-the-approach)
67
+ - [LoRA Fine-Tuning](#lora-fine-tuning)
68
+ - [Mathematical Foundations](#mathematical-foundations)
69
+ - [Mermaid Diagram](#mermaid-diagram)
70
+ - [Cultural Integrity](#cultural-integrity)
71
+ - [Data Sources](#data-sources)
72
+ - [Expanding the Concept](#expanding-the-concept)
73
+ - [Adaptive Checkpoints](#adaptive-checkpoints)
74
+ - [Example Workflow](#example-workflow)
75
+ - [Monitoring & QA](#monitoring--qa)
76
+ - [Future Directions](#future-directions)
77
+ - [Glossary](#glossary)
78
+
79
+ ---
80
+
81
+ ## Introduction
82
+
83
+ In my office, there is a murder; a map of one, at least.
84
+
85
+ ![Dawson's Map of the Bow Valley](Public/FullDawsonMap.jpg)
86
+
87
+ George Mercer Dawson explored the Bow Valley in the late 1800s, noting language on the British Columbia side. His map, though richly colored, stands like a tombstone over the Bow Valley where the Stoney people lived because he made no notes on their language and simply noted the people as "recent immigrants"
88
+
89
+ ![Detail of Dawson Map](Public/dawsondetail.jpg)
90
+
91
+ What is very obvious from the linguistic patterns among the Haida, Tshimsia, Thlinkit, Kwakiool and Kawitshin dialects nearby is that languages blend like “linguistic DNA,” and machine learning could help trace faint threads of lost speech to their roots. Where some see isolation as a curse, in the age of AI, Stoney’s isolation turns out to be its strength.
92
+
93
+ For about two years, I thought about the size of the vector space that would be needed to get a model to self-train on a set of 100% indigenous data, and how that model could refine its grasp of the broader Stoney Language. This is now publicly and freely available.
94
+
95
+
96
+ Two key releases influenced my thinking of what was possible:
97
+
98
+ 1. [Meta’s Llama-3 Model (April 18th, 2024)](https://www.reuters.com/technology/meta-releases-early-versions-its-llama-3-ai-model-2024-04-18/)
99
+ 2. [OpenAI Fine-Tuning API (October 2024)](https://openai.com/index/api-model-distillation/)
100
+
101
+ Both gave me the motivation to build what’s presented here. The true innovation here lies in how communities can narratively correct the initially flawed response (about 10% of the time, the model works every time.) then that feeback be passed seamleslly back into the fine-tuning process. The [textbooks](https://globalnews.ca/news/9430501/stoney-nakota-language-textbook/) that the Stoney community created—intended as educational tools—became perfect concept of a model prompts, each chapter or word offering pure indigenous data devoid of external weights or biases to the fine-tuning process.
102
+
103
+
104
+ Early in 2023, I found an original, unpublished sketch by James Hector likely drawn in the summer of 1858 or 1859 along the Bow River in Southern Alberta:
105
+
106
+ ![Sketch by James Hector of a Stoney Woman](Public/StoneyWoman.jpg)
107
+
108
+ Finding this, and already aware of George Mercer Dawson's work on First Nation's language on the British Columbia side, I was inspired to put the effort in and build a working model of the language and implement the Community-In-The-Loop distillation method.
109
+
110
+ This sketch shifted my thinking from considering the "Stoney People” to this "Stoney Woman” who saw these same mountains and rivers I see everyday, yet who had a very different way to think about and communicate to the world around her. The Community-in-the-Loop model distillation will quickly converge this initial model toward fluencey. I suspect this will require the community to correct about 80,000 question and answer pairs and would cost less than $800 in OpenAI computing power. Recent releases by Google and the Chinese Lab DeepSeek, could effectively reduce the cost to zero.
111
+
112
+ I think what this project has left me considering most ist that a century from now, strangers will live in all our homes and most of what we worry about today will not matter. But we can honor “Stoney Woman” by making sure her language endures, forging a living record in an age of AI. Incredibly, this tool will work with any first nations language, as long as there is a starting dictionary of about 8,000 words.
113
+
114
+ **I am freely available to help any First Nation in Canada.**
115
+
116
+ ## Understanding How AI Learns Stoney Words Using Cosine Similarity
117
+
118
+ Word Embeddings: Mapping Words in Space
119
+ Word embeddings are like placing words in a high-dimensional map, where similar words are positioned closer together. For example, "strawberry," "orange," and "cherry" might form a cluster because they are fruits, while "laptop," "Microsoft," and "Android" might cluster elsewhere as tech-related terms. Each axis in this space represents a characteristic of the words, such as their context or meaning.
120
+
121
+ Context Shapes Meaning
122
+ A word's position in this space isn't fixed—it shifts based on context. For instance, the word "apple" could mean a fruit or the tech brand, depending on its surrounding words, like "buy" (tech) or "tree" (fruit). This dynamic placement captures the nuances of meaning.
123
+
124
+ Cosine Similarity: Measuring Relationships
125
+ Cosine similarity quantifies how similar two words are by measuring the angle between their vectors in the embedding space:
126
+
127
+ - Similar words have vectors pointing in nearly the same direction (cosine similarity close to 1)
128
+ - Unrelated words have vectors at a right angle (cosine similarity near 0)
129
+ - Opposite meanings have vectors pointing in opposite directions (cosine similarity close to -1)
130
+ - For example, "cherry" and "orange" might have a similarity of 0.97, while "cherry" and "laptop" might score 0.24
131
+
132
+ How AI Learns Stoney Words
133
+
134
+ - **Stoney Dictionary as a Starting Point:**
135
+ The AI begins with a structured dictionary of Stoney words, including translations, categories, pronunciations, and cultural context.
136
+
137
+ - **Community Feedback for Learning:**
138
+ The AI makes initial translations, which are often incorrect. Stoney speakers provide corrections, enriched with cultural context, stories, and humor. This feedback helps refine the AI's understanding.
139
+
140
+ The Role of Cosine Similarity in AI Learning
141
+
142
+ - The AI uses word embeddings to group Stoney words based on their meaning. For example, it determines whether a word belongs to a category like "fruit," "animal," or "spiritual."
143
+ - Community corrections and cosine similarity guide the AI in repositioning words closer to their accurate groupings in the embedding space.
144
+
145
+ Iterative Refinement
146
+ Through repeated feedback and fine-tuning, the AI improves its ability to place Stoney words correctly, not just individually but in the context of sentences and paragraphs. Over time, it develops a detailed, dynamic map of the Stoney language, with words clustered according to their community-informed meanings and uses.
147
+
148
+ Although this is not cosine similarity, you can see the relationships among words can concepts in Stoney as I have mapped them here: https://atlas.nomic.ai/data/harleycoops/stoney-nakoda-language-synthetic/map/5c87caaf-6be0-4546-9e83-826569070b24#nqlL
149
+
150
+
151
+ ---
152
+
153
+ ## Project Architecture
154
+
155
+ This code forms a complete pipeline for training and deploying a Stoney model. It is fully functional—but not correct 100% of the time—and is designed to improve through Community-In-The-Loop feedback. Access the model here:
156
+ [Stoney Language Model App](https://huggingface.co/spaces/HarleyCooper/StoneyApp)
157
+
158
+ ### High-Level System Design
159
+
160
+ 1. **Data Ingestion Layer**
161
+ 2. **Processing Pipeline** (Q&A generation, augmentation, conversion)
162
+ 3. **Model Training Framework** (fine-tuning, hyperparameters, monitoring)
163
+ 4. **Inference Interface** (API endpoint, response formatting, error handling)
164
+
165
+ ### Data Flow
166
+
167
+ 1. Raw dictionary data → Data Ingestion
168
+ 2. Processed data → Q&A Generation
169
+ 3. Generated Q&A pairs → Training Data Preparation
170
+ 4. Prepared data → Model Fine-tuning
171
+ 5. Fine-tuned model → Inference Interface
172
+
173
+ ---
174
+
175
+ ## Detailed Project Structure
176
+
177
+ ```
178
+ PUBLICRELEASE/
179
+ ├── OpenAIFineTune/ # OpenAI fine-tuning files
180
+ │ ├── stoney_train.jsonl # Training dataset
181
+ │ └── stoney_valid.jsonl # Validation dataset
182
+ ├── checkpoints/ # Model checkpoints
183
+ ├── .env.example # Env variables example
184
+ ├── requirements.txt # Python dependencies
185
+ ├── english_dictionary.jsonl
186
+ ├── stoney_dictionary.jsonl
187
+ └── bilingual_training_set.jsonl
188
+ ```
189
+
190
+ ---
191
+
192
+ ## Core Components
193
+
194
+ ### Data Generation & Processing
195
+
196
+ - **`bilingual_qa_generator.py`**
197
+ Generates Q&A pairs from dictionaries, using advanced language generation.
198
+
199
+ - **`convert_data_format.py`**
200
+ Supports multiple data formats; validates and enforces schemas.
201
+
202
+ - **`finetunesetup.py`**
203
+ Splits data (80/20) with stratified sampling and prepares files.
204
+
205
+ ### Model Training
206
+
207
+ - **`openai_finetune.py`**
208
+ Handles fine-tuning, error handling, checkpointing, and logging.
209
+
210
+ ---
211
+
212
+ ## Comprehensive Setup Instructions
213
+
214
+ ### System Requirements
215
+
216
+ - Python 3.8+
217
+ - 8GB+ RAM (16GB recommended)
218
+ - 10GB free disk space
219
+ - Stable internet connection
220
+
221
+ ### Environment Setup
222
+
223
+ ```bash
224
+ # Clone the repository
225
+ git clone [repository-url]
226
+ cd PUBLICRELEASE
227
+
228
+ # Create and activate a virtual environment
229
+ python -m venv venv
230
+ source venv/bin/activate # Windows: venv\Scripts\activate
231
+
232
+ # Install dependencies
233
+ pip install -r requirements.txt
234
+
235
+ ```
236
+
237
+ ### Configuration
238
+
239
+ ```bash
240
+ # Copy example environment file
241
+ cp .env.example .env
242
+ # Provide OPENAI_API_KEY, GOOGLE_API_KEY in .env
243
+
244
+ ```
245
+
246
+ ### Initialization
247
+
248
+ ```bash
249
+ python initialize.py
250
+
251
+ ```
252
+
253
+ ----------
254
+
255
+ ## Detailed Usage Pipeline
256
+
257
+ ### 1. Generate Training Data
258
+
259
+ ```bash
260
+ python bilingual_qa_generator.py
261
+
262
+ ```
263
+
264
+ - Processes `english_dictionary.jsonl` & `stoney_dictionary.jsonl`
265
+ - Produces `bilingual_training_set.jsonl`
266
+
267
+ ### 2. Prepare Fine-tuning Data
268
+
269
+ ```bash
270
+ python finetunesetup.py
271
+
272
+ ```
273
+
274
+ - Converts Q&A to OpenAI format
275
+ - Outputs `OpenAIFineTune/stoney_train.jsonl` & `stoney_valid.jsonl`
276
+
277
+ ### 3. Fine-tune Model
278
+
279
+ ```bash
280
+ python openai_finetune.py
281
+
282
+ ```
283
+
284
+ - Uploads files to OpenAI
285
+ - Monitors fine-tuning progress
286
+ - Implements checkpointing & logs
287
+
288
+ ----------
289
+
290
+ ## Advanced Model Configuration
291
+
292
+ ### OpenAI Models
293
+
294
+ - Default: `gpt-4o-2024-08-06`
295
+ - Alternative: `gpt-3.5-turbo`
296
+ - `.env`: `OPENAI_MODEL`
297
+
298
+ ### Google Gemini
299
+
300
+ - Default: `gemini-2.0-exp`
301
+ - `.env`: `GEMINI_MODEL`
302
+
303
+ ### Hyperparameters
304
+
305
+ - LR: `1e-5`
306
+ - Batch size: `32`
307
+ - Epochs: `3`
308
+ - Context window: `4096`
309
+
310
+ ----------
311
+
312
+ ## Comprehensive Data Formats
313
+
314
+ ### Dictionary Format
315
+
316
+ ```json
317
+ {
318
+ "english_word": "example",
319
+ "stoney_versions": [
320
+ {
321
+ "word": "...",
322
+ "grammatical_classification": "...",
323
+ "meaning": "..."
324
+ }
325
+ ]
326
+ }
327
+
328
+ ```
329
+
330
+ ### Q&A Format
331
+
332
+ ```json
333
+ {
334
+ "question": "How do you say X in Stoney?",
335
+ "answer": "The Stoney word for X is...",
336
+ "source_language": "english",
337
+ "generated_at": "timestamp"
338
+ }
339
+
340
+ ```
341
+
342
+ ### OpenAI Training Format
343
+
344
+ ```json
345
+ {
346
+ "messages": [
347
+ {"role": "system", "content": "You are a bilingual Stoney-English assistant..."},
348
+ {"role": "user", "content": "question"},
349
+ {"role": "assistant", "content": "answer"}
350
+ ]
351
+ }
352
+
353
+ ```
354
+
355
+ ----------
356
+
357
+ ## Development Guidelines
358
+
359
+ - **Style**: PEP 8, type hints, docstrings, consistent naming
360
+ - **Testing**: Unit tests, integration tests, CI, coverage
361
+ - **Documentation**: Inline comments, usage examples, troubleshooting
362
+
363
+ ----------
364
+
365
+ ## Contributing
366
+
367
+ 1. Fork, branch, implement changes, test
368
+ 2. Submit a pull request
369
+
370
+ **Code Review**
371
+
372
+ - Clear commits, small changes, documentation, test coverage
373
+
374
+ ----------
375
+
376
+ ## The Community-in-the-Loop Revolution
377
+
378
+ ### Introduction
379
+
380
+ This project aims to preserve, refine, and resurrect endangered languages via AI fine-tuning and model distillation. Minimal lexical data can evolve into a culturally rich digital speaker of Stoney Nakoda. This subverts assumptions that massive datasets are necessary, instead emphasizing:
381
+
382
+ - Iterative improvement with community feedback
383
+ - Narrative corrections (cultural context over simple dictionary entries)
384
+ - Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning
385
+
386
+ ### Conceptual Overview
387
+
388
+ **Community-in-the-Loop Model Distillation**:
389
+
390
+ 1. Start with a small dictionary/text set.
391
+ 2. Prompt an initial model.
392
+ 3. Let the community correct errors with storytelling and context, not just words.
393
+ 4. LoRA-based fine-tuning absorbs these narrative corrections.
394
+ 5. The model evolves iteratively, guided by cultural custodians.
395
+
396
+ ### Heart of the Approach
397
+
398
+ - **Intentional Errors**: Poke the model with tough or context-specific queries.
399
+ - **Narrative Corrections**: Rich cultural commentary instead of bare “right vs. wrong.”
400
+ - **Distillation Triplets**: (Prompt, Disallowed Reply, Narrative Reply).
401
+ - **Iterative Improvement**: If the model stumbles, revert and add more context.
402
+
403
+ ### LoRA Fine-Tuning
404
+
405
+ LoRA attaches small, low-rank matrices to the base model. This dramatically reduces compute and speeds up retraining:
406
+
407
+ - **Efficiency**: Fraction of resources required vs. full retraining
408
+ - **Focused Updates**: Capturing the “essence” of new knowledge
409
+ - **Rapid Iterations**: Frequent refinement without heavy overhead
410
+
411
+ ### Mathematical Foundations
412
+
413
+ If W0\mathbf{W}_0 is the base weight matrix, LoRA introduces ΔW=AB\Delta \mathbf{W} = \mathbf{A}\mathbf{B} with A∈Rd×r\mathbf{A} \in \mathbb{R}^{d \times r} and B∈Rr×k\mathbf{B} \in \mathbb{R}^{r \times k}, where r≪min⁡(d,k)r \ll \min(d,k). Loss functions track both linguistic and cultural accuracy (e.g., a “Cultural Authenticity Score”).
414
+
415
+ ### Mermaid Diagram
416
+
417
+ ```mermaid
418
+ graph TD
419
+ A[Initial Model] --> B[Generate Response]
420
+ B --> C{Correct?}
421
+ C -->|No| D[Community Correction]
422
+ D --> E[Create Distillation Triplet]
423
+ E --> F[LoRA Fine-Tuning]
424
+ F --> A
425
+ C -->|Yes| G[Validation]
426
+
427
+ ```
428
+
429
+ ### Cultural Integrity
430
+
431
+ Every correction preserves cultural norms—idioms, humor, oral traditions—and ensures the community wields control over the AI’s “mindset.”
432
+
433
+ ### Data Sources
434
+
435
+ A 10,000-word Stoney Nakoda dictionary and community textbooks serve as seeds. Community feedback enriches this data over time, weaving historical memory into the model.
436
+
437
+ ### Expanding the Concept
438
+
439
+ From a tiny dictionary to an AI that:
440
+
441
+ - **Understands context** (formal/informal usage)
442
+ - **Integrates cultural references** (stories, metaphors)
443
+ - **Remembers history** (ancestors, ceremonies, seasonal events)
444
+
445
+ ### Adaptive Checkpoints
446
+
447
+ - **Forward Progress**: Keep the new checkpoint if improved.
448
+ - **Reversion**: If degraded, roll back and increase context in corrections.
449
+ - **Convergence**: Repeat until stable authenticity and fluency metrics are met.
450
+
451
+ ### Example Workflow
452
+
453
+ 1. **Prompt**: “How to say ‘taste slightly with the tip of your tongue’ in Stoney?”
454
+ 2. **Model’s Flawed Reply**: “`supthîyach`” (incorrect).
455
+ 3. **Community Correction**: Shares the correct phrase plus a story from childhood.
456
+ 4. **Distillation Triplet**: (Prompt, Disallowed, Narrative).
457
+ 5. **LoRA Fine-Tuning**: Model adjusts swiftly.
458
+ 6. **Re-Evaluation**: Answers improve in subsequent queries.
459
+
460
+ ### Monitoring & QA
461
+
462
+ - **Cultural Authenticity Score (CAS)**
463
+ - **Linguistic Fluency** (perplexity, cross-entropy)
464
+ - **Validation Loops** (watch for regressions, revert if needed)
465
+
466
+ ### Future Directions
467
+
468
+ - **Oral Histories**: Model retells century-old stories.
469
+ - **Seasonal Knowledge**: Terms tied to ceremonies and ecological cycles.
470
+ - **Dialects/Accents**: Respecting sub-regional differences.
471
+ - **Educational Tools**: Interactive AI for language learning.
472
+ - **Ethical AI**: Centered on consent, community governance, cultural integrity.
473
+
474
+ ### Glossary
475
+
476
+ - **CAS**: Cultural Authenticity Score
477
+ - **Distillation Triplet**: (Prompt, Flawed Reply, Narrative Reply)
478
+ - **LoRA**: Low-Rank Adaptation
479
+ - **Community-in-the-Loop**: Paradigm of continuous human-guided refinement
480
+ """
481
+
482
+ # Store conversation history
483
+ conversation_history = []
484
+
485
+ def process_initial_prompt():
486
+ """Process the initial prompt and return the response"""
487
+ generation_config = {
488
+ "temperature": 1.1,
489
+ "top_p": .95,
490
+ "top_k": 1,
491
+ "max_output_tokens": 4500,
492
+ }
493
+
494
+ safety_settings = [
495
+ {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
496
+ {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
497
+ {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
498
+ {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
499
+ ]
500
+
501
+ response = model.generate_content(
502
+ INITIAL_PROMPT,
503
+ generation_config=generation_config,
504
+ safety_settings=safety_settings,
505
+ stream=True
506
+ )
507
+ return response
508
+
509
+ def process_follow_up(message, history):
510
+ """Process follow-up questions using the context from the initial prompt"""
511
+ # Format history into a string
512
+ history_str = "\n".join([f"Human: {h[0]}\nAssistant: {h[1]}" for h in history if h[0] is not None])
513
+
514
+ # Combine the original prompt, history, and new question
515
+ full_context = f"{INITIAL_PROMPT}\n\nPrevious conversation:\n{history_str}\n\nNew question: {message}"
516
+
517
+ generation_config = {
518
+ "temperature": 0.9,
519
+ "top_p": 1,
520
+ "top_k": 1,
521
+ "max_output_tokens": 2048,
522
+ }
523
+
524
+ safety_settings = [
525
+ {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
526
+ {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
527
+ {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
528
+ {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
529
+ ]
530
+
531
+ response = model.generate_content(
532
+ full_context,
533
+ generation_config=generation_config,
534
+ safety_settings=safety_settings,
535
+ stream=True
536
+ )
537
+
538
+ # Collect the response chunks
539
+ response_text = ""
540
+ for chunk in response:
541
+ response_text += chunk.text
542
+ yield [[message, response_text]]
543
+
544
+ def create_interface():
545
+ """Create and configure the Gradio interface"""
546
+ with gr.Blocks(css="footer {visibility: hidden}") as demo:
547
+ gr.Markdown("# You are Asking Google Deep Mind about \"From Whispers to Voices\", it needs 15 seconds to think")
548
+ chatbot = gr.Chatbot(show_label=False)
549
+
550
+ # Add custom CSS for wider chat window and proper scrolling
551
+ gr.HTML("""
552
+ <style>
553
+ .gradio-container {
554
+ max-width: 95% !important;
555
+ margin-left: auto !important;
556
+ margin-right: auto !important;
557
+ min-height: 100vh !important;
558
+ }
559
+ .contain {
560
+ min-height: 85vh !important;
561
+ }
562
+ .wrap.svelte-byatnx {
563
+ max-height: none !important;
564
+ overflow: visible !important;
565
+ }
566
+ .message.svelte-byatnx {
567
+ overflow-wrap: break-word !important;
568
+ white-space: pre-wrap !important;
569
+ }
570
+ </style>
571
+ """)
572
+
573
+ # Auto-trigger the initial prompt on page load
574
+ def on_load():
575
+ response = process_initial_prompt()
576
+ response_text = ""
577
+ for chunk in response:
578
+ response_text += chunk.text
579
+ yield [[None, response_text]]
580
+
581
+ demo.load(on_load, None, [chatbot])
582
+
583
+ return demo
584
+
585
+ # Create and launch the interface
586
+ demo = create_interface()
587
+
588
+ if __name__ == "__main__":
589
+ demo.launch()
instructions.txt ADDED
@@ -0,0 +1,480 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 1. We are going to build a very simple Gradio app or hugging face space that is going to start by passing a large prompt to the new Gemini 2.0 Flash thinking experimental 1-21 model and ask a specific question.
2
+
3
+ 2. The intent will be that the user will have the answer streaming as soon as they open the app. The user will be able to ask a new question using the context in the original prompt and the streaming thinking that is available in the API.
4
+
5
+ 3. We will deploy this with a simple link to the gradio app and we want to start the app by automatically passing this prompt to the API and waiting for the reply:
6
+
7
+ This is the exact prompt we want to pass to the model but we also want to "hide" this from the user so the first thing the user sees as they open the gradio App is the reply to this question:
8
+
9
+ "based on the totality of your training, how revolutionary is this project by Christian H. Cooper given it is a working model that has been fine tuned and deployed
10
+
11
+
12
+ # From Whispers to Voices: A "Community-In-The-Loop" Proposal for Model Distillation and Language Preservation
13
+
14
+ New Years Day, 2025
15
+
16
+ A working model of the Stoney Nakoda language has been developed and is now available for community-in-the-loop testing in 2025:
17
+
18
+ - **Model App**: [Stoney Language Model App](https://huggingface.co/spaces/HarleyCooper/StoneyApp)
19
+ - **Training Data**: [StoneyNakoda Training Dataset](https://huggingface.co/datasets/HarleyCooper/StoneyNakoda/blob/main/zSTONEY1_TRAINING_SET.jsonl)
20
+
21
+
22
+ Any First Nations community seeking to apply this approach to their own language is warmly invited to reach out.
23
+
24
+ By following this code, you can build a model for any low-resource language. The starting dictionary size should be ~8,000 words.
25
+
26
+ ---
27
+
28
+ ## Table of Contents
29
+
30
+ 1. [New Years Day, Canadian Rockies, 2025](#introduction)
31
+ 2. [Understanding How AI Learns Stoney Words Using Cosine Similarity](#understanding-how-ai-learns-stoney-words-using-cosine-similarity)
32
+ 3. [Project Architecture](#project-architecture)
33
+ - [High-Level System Design](#high-level-system-design)
34
+ - [Data Flow](#data-flow)
35
+ 4. [Detailed Project Structure](#detailed-project-structure)
36
+ 5. [Core Components](#core-components)
37
+ - [Data Generation & Processing](#data-generation--processing)
38
+ - [Model Training](#model-training)
39
+ 6. [Comprehensive Setup Instructions](#comprehensive-setup-instructions)
40
+ - [System Requirements](#system-requirements)
41
+ - [Environment Setup](#environment-setup)
42
+ - [Configuration](#configuration)
43
+ - [Initialization](#initialization)
44
+ 7. [Detailed Usage Pipeline](#detailed-usage-pipeline)
45
+ 1. [Generate Training Data](#1-generate-training-data)
46
+ 2. [Prepare Fine-tuning Data](#2-prepare-fine-tuning-data)
47
+ 3. [Fine-tune Model](#3-fine-tune-model)
48
+ 8. [Advanced Model Configuration](#advanced-model-configuration)
49
+ - [OpenAI Models](#openai-models)
50
+ - [Google Gemini](#google-gemini)
51
+ - [Hyperparameters](#hyperparameters)
52
+ 9. [Comprehensive Data Formats](#comprehensive-data-formats)
53
+ - [Dictionary Format](#dictionary-format)
54
+ - [Q&A Format](#qa-format)
55
+ - [OpenAI Training Format](#openai-training-format)
56
+ 10. [Development Guidelines](#development-guidelines)
57
+ 11. [Contributing](#contributing)
58
+ 12. [License](#license)
59
+ 13. [Acknowledgments](#acknowledgments)
60
+ 14. [The Community-in-the-Loop Revolution](#the-community-in-the-loop-revolution)
61
+ - [Introduction](#introduction-1)
62
+ - [Conceptual Overview](#conceptual-overview)
63
+ - [Heart of the Approach](#heart-of-the-approach)
64
+ - [LoRA Fine-Tuning](#lora-fine-tuning)
65
+ - [Mathematical Foundations](#mathematical-foundations)
66
+ - [Mermaid Diagram](#mermaid-diagram)
67
+ - [Cultural Integrity](#cultural-integrity)
68
+ - [Data Sources](#data-sources)
69
+ - [Expanding the Concept](#expanding-the-concept)
70
+ - [Adaptive Checkpoints](#adaptive-checkpoints)
71
+ - [Example Workflow](#example-workflow)
72
+ - [Monitoring & QA](#monitoring--qa)
73
+ - [Future Directions](#future-directions)
74
+ - [Glossary](#glossary)
75
+
76
+ ---
77
+
78
+ ## Introduction
79
+
80
+ In my office, there is a murder; a map of one, at least.
81
+
82
+ ![Dawson's Map of the Bow Valley](Public/FullDawsonMap.jpg)
83
+
84
+ George Mercer Dawson explored the Bow Valley in the late 1800s, noting language on the British Columbia side. His map, though richly colored, stands like a tombstone over the Bow Valley where the Stoney people lived because he made no notes on their language and simply noted the people as "recent immigrants"
85
+
86
+ ![Detail of Dawson Map](Public/dawsondetail.jpg)
87
+
88
+ What is very obvious from the linguistic patterns among the Haida, Tshimsia, Thlinkit, Kwakiool and Kawitshin dialects nearby is that languages blend like “linguistic DNA,” and machine learning could help trace faint threads of lost speech to their roots. Where some see isolation as a curse, in the age of AI, Stoney’s isolation turns out to be its strength.
89
+
90
+ For about two years, I thought about the size of the vector space that would be needed to get a model to self-train on a set of 100% indigenous data, and how that model could refine its grasp of the broader Stoney Language. This is now publicly and freely available.
91
+
92
+
93
+ Two key releases influenced my thinking of what was possible:
94
+
95
+ 1. [Meta’s Llama-3 Model (April 18th, 2024)](https://www.reuters.com/technology/meta-releases-early-versions-its-llama-3-ai-model-2024-04-18/)
96
+ 2. [OpenAI Fine-Tuning API (October 2024)](https://openai.com/index/api-model-distillation/)
97
+
98
+ Both gave me the motivation to build what’s presented here. The true innovation here lies in how communities can narratively correct the initially flawed response (about 10% of the time, the model works every time.) then that feeback be passed seamleslly back into the fine-tuning process. The [textbooks](https://globalnews.ca/news/9430501/stoney-nakota-language-textbook/) that the Stoney community created—intended as educational tools—became perfect concept of a model prompts, each chapter or word offering pure indigenous data devoid of external weights or biases to the fine-tuning process.
99
+
100
+
101
+ Early in 2023, I found an original, unpublished sketch by James Hector likely drawn in the summer of 1858 or 1859 along the Bow River in Southern Alberta:
102
+
103
+ ![Sketch by James Hector of a Stoney Woman](Public/StoneyWoman.jpg)
104
+
105
+ Finding this, and already aware of George Mercer Dawson's work on First Nation's language on the British Columbia side, I was inspired to put the effort in and build a working model of the language and implement the Community-In-The-Loop distillation method.
106
+
107
+ This sketch shifted my thinking from considering the "Stoney People” to this "Stoney Woman” who saw these same mountains and rivers I see everyday, yet who had a very different way to think about and communicate to the world around her. The Community-in-the-Loop model distillation will quickly converge this initial model toward fluencey. I suspect this will require the community to correct about 80,000 question and answer pairs and would cost less than $800 in OpenAI computing power. Recent releases by Google and the Chinese Lab DeepSeek, could effectively reduce the cost to zero.
108
+
109
+ I think what this project has left me considering most ist that a century from now, strangers will live in all our homes and most of what we worry about today will not matter. But we can honor “Stoney Woman” by making sure her language endures, forging a living record in an age of AI. Incredibly, this tool will work with any first nations language, as long as there is a starting dictionary of about 8,000 words.
110
+
111
+ **I am freely available to help any First Nation in Canada.**
112
+
113
+ ## Understanding How AI Learns Stoney Words Using Cosine Similarity
114
+
115
+ Word Embeddings: Mapping Words in Space
116
+ Word embeddings are like placing words in a high-dimensional map, where similar words are positioned closer together. For example, "strawberry," "orange," and "cherry" might form a cluster because they are fruits, while "laptop," "Microsoft," and "Android" might cluster elsewhere as tech-related terms. Each axis in this space represents a characteristic of the words, such as their context or meaning.
117
+
118
+ Context Shapes Meaning
119
+ A word's position in this space isn't fixed—it shifts based on context. For instance, the word "apple" could mean a fruit or the tech brand, depending on its surrounding words, like "buy" (tech) or "tree" (fruit). This dynamic placement captures the nuances of meaning.
120
+
121
+ Cosine Similarity: Measuring Relationships
122
+ Cosine similarity quantifies how similar two words are by measuring the angle between their vectors in the embedding space:
123
+
124
+ - Similar words have vectors pointing in nearly the same direction (cosine similarity close to 1)
125
+ - Unrelated words have vectors at a right angle (cosine similarity near 0)
126
+ - Opposite meanings have vectors pointing in opposite directions (cosine similarity close to -1)
127
+ - For example, "cherry" and "orange" might have a similarity of 0.97, while "cherry" and "laptop" might score 0.24
128
+
129
+ How AI Learns Stoney Words
130
+
131
+ - **Stoney Dictionary as a Starting Point:**
132
+ The AI begins with a structured dictionary of Stoney words, including translations, categories, pronunciations, and cultural context.
133
+
134
+ - **Community Feedback for Learning:**
135
+ The AI makes initial translations, which are often incorrect. Stoney speakers provide corrections, enriched with cultural context, stories, and humor. This feedback helps refine the AI's understanding.
136
+
137
+ The Role of Cosine Similarity in AI Learning
138
+
139
+ - The AI uses word embeddings to group Stoney words based on their meaning. For example, it determines whether a word belongs to a category like "fruit," "animal," or "spiritual."
140
+ - Community corrections and cosine similarity guide the AI in repositioning words closer to their accurate groupings in the embedding space.
141
+
142
+ Iterative Refinement
143
+ Through repeated feedback and fine-tuning, the AI improves its ability to place Stoney words correctly, not just individually but in the context of sentences and paragraphs. Over time, it develops a detailed, dynamic map of the Stoney language, with words clustered according to their community-informed meanings and uses.
144
+
145
+ Although this is not cosine similarity, you can see the relationships among words can concepts in Stoney as I have mapped them here: https://atlas.nomic.ai/data/harleycoops/stoney-nakoda-language-synthetic/map/5c87caaf-6be0-4546-9e83-826569070b24#nqlL
146
+
147
+
148
+ ---
149
+
150
+ ## Project Architecture
151
+
152
+ This code forms a complete pipeline for training and deploying a Stoney model. It is fully functional—but not correct 100% of the time—and is designed to improve through Community-In-The-Loop feedback. Access the model here:
153
+ [Stoney Language Model App](https://huggingface.co/spaces/HarleyCooper/StoneyApp)
154
+
155
+ ### High-Level System Design
156
+
157
+ 1. **Data Ingestion Layer**
158
+ 2. **Processing Pipeline** (Q&A generation, augmentation, conversion)
159
+ 3. **Model Training Framework** (fine-tuning, hyperparameters, monitoring)
160
+ 4. **Inference Interface** (API endpoint, response formatting, error handling)
161
+
162
+ ### Data Flow
163
+
164
+ 1. Raw dictionary data → Data Ingestion
165
+ 2. Processed data → Q&A Generation
166
+ 3. Generated Q&A pairs → Training Data Preparation
167
+ 4. Prepared data → Model Fine-tuning
168
+ 5. Fine-tuned model → Inference Interface
169
+
170
+ ---
171
+
172
+ ## Detailed Project Structure
173
+
174
+ ```
175
+ PUBLICRELEASE/
176
+ ├── OpenAIFineTune/ # OpenAI fine-tuning files
177
+ │ ├── stoney_train.jsonl # Training dataset
178
+ │ └── stoney_valid.jsonl # Validation dataset
179
+ ├── checkpoints/ # Model checkpoints
180
+ ├── .env.example # Env variables example
181
+ ├── requirements.txt # Python dependencies
182
+ ├── english_dictionary.jsonl
183
+ ├── stoney_dictionary.jsonl
184
+ └── bilingual_training_set.jsonl
185
+ ```
186
+
187
+ ---
188
+
189
+ ## Core Components
190
+
191
+ ### Data Generation & Processing
192
+
193
+ - **`bilingual_qa_generator.py`**
194
+ Generates Q&A pairs from dictionaries, using advanced language generation.
195
+
196
+ - **`convert_data_format.py`**
197
+ Supports multiple data formats; validates and enforces schemas.
198
+
199
+ - **`finetunesetup.py`**
200
+ Splits data (80/20) with stratified sampling and prepares files.
201
+
202
+ ### Model Training
203
+
204
+ - **`openai_finetune.py`**
205
+ Handles fine-tuning, error handling, checkpointing, and logging.
206
+
207
+ ---
208
+
209
+ ## Comprehensive Setup Instructions
210
+
211
+ ### System Requirements
212
+
213
+ - Python 3.8+
214
+ - 8GB+ RAM (16GB recommended)
215
+ - 10GB free disk space
216
+ - Stable internet connection
217
+
218
+ ### Environment Setup
219
+
220
+ ```bash
221
+ # Clone the repository
222
+ git clone [repository-url]
223
+ cd PUBLICRELEASE
224
+
225
+ # Create and activate a virtual environment
226
+ python -m venv venv
227
+ source venv/bin/activate # Windows: venv\Scripts\activate
228
+
229
+ # Install dependencies
230
+ pip install -r requirements.txt
231
+
232
+ ```
233
+
234
+ ### Configuration
235
+
236
+ ```bash
237
+ # Copy example environment file
238
+ cp .env.example .env
239
+ # Provide OPENAI_API_KEY, GOOGLE_API_KEY in .env
240
+
241
+ ```
242
+
243
+ ### Initialization
244
+
245
+ ```bash
246
+ python initialize.py
247
+
248
+ ```
249
+
250
+ ----------
251
+
252
+ ## Detailed Usage Pipeline
253
+
254
+ ### 1. Generate Training Data
255
+
256
+ ```bash
257
+ python bilingual_qa_generator.py
258
+
259
+ ```
260
+
261
+ - Processes `english_dictionary.jsonl` & `stoney_dictionary.jsonl`
262
+ - Produces `bilingual_training_set.jsonl`
263
+
264
+ ### 2. Prepare Fine-tuning Data
265
+
266
+ ```bash
267
+ python finetunesetup.py
268
+
269
+ ```
270
+
271
+ - Converts Q&A to OpenAI format
272
+ - Outputs `OpenAIFineTune/stoney_train.jsonl` & `stoney_valid.jsonl`
273
+
274
+ ### 3. Fine-tune Model
275
+
276
+ ```bash
277
+ python openai_finetune.py
278
+
279
+ ```
280
+
281
+ - Uploads files to OpenAI
282
+ - Monitors fine-tuning progress
283
+ - Implements checkpointing & logs
284
+
285
+ ----------
286
+
287
+ ## Advanced Model Configuration
288
+
289
+ ### OpenAI Models
290
+
291
+ - Default: `gpt-4o-2024-08-06`
292
+ - Alternative: `gpt-3.5-turbo`
293
+ - `.env`: `OPENAI_MODEL`
294
+
295
+ ### Google Gemini
296
+
297
+ - Default: `gemini-2.0-exp`
298
+ - `.env`: `GEMINI_MODEL`
299
+
300
+ ### Hyperparameters
301
+
302
+ - LR: `1e-5`
303
+ - Batch size: `32`
304
+ - Epochs: `3`
305
+ - Context window: `4096`
306
+
307
+ ----------
308
+
309
+ ## Comprehensive Data Formats
310
+
311
+ ### Dictionary Format
312
+
313
+ ```json
314
+ {
315
+ "english_word": "example",
316
+ "stoney_versions": [
317
+ {
318
+ "word": "...",
319
+ "grammatical_classification": "...",
320
+ "meaning": "..."
321
+ }
322
+ ]
323
+ }
324
+
325
+ ```
326
+
327
+ ### Q&A Format
328
+
329
+ ```json
330
+ {
331
+ "question": "How do you say X in Stoney?",
332
+ "answer": "The Stoney word for X is...",
333
+ "source_language": "english",
334
+ "generated_at": "timestamp"
335
+ }
336
+
337
+ ```
338
+
339
+ ### OpenAI Training Format
340
+
341
+ ```json
342
+ {
343
+ "messages": [
344
+ {"role": "system", "content": "You are a bilingual Stoney-English assistant..."},
345
+ {"role": "user", "content": "question"},
346
+ {"role": "assistant", "content": "answer"}
347
+ ]
348
+ }
349
+
350
+ ```
351
+
352
+ ----------
353
+
354
+ ## Development Guidelines
355
+
356
+ - **Style**: PEP 8, type hints, docstrings, consistent naming
357
+ - **Testing**: Unit tests, integration tests, CI, coverage
358
+ - **Documentation**: Inline comments, usage examples, troubleshooting
359
+
360
+ ----------
361
+
362
+ ## Contributing
363
+
364
+ 1. Fork, branch, implement changes, test
365
+ 2. Submit a pull request
366
+
367
+ **Code Review**
368
+
369
+ - Clear commits, small changes, documentation, test coverage
370
+
371
+ ----------
372
+
373
+ ## The Community-in-the-Loop Revolution
374
+
375
+ ### Introduction
376
+
377
+ This project aims to preserve, refine, and resurrect endangered languages via AI fine-tuning and model distillation. Minimal lexical data can evolve into a culturally rich digital speaker of Stoney Nakoda. This subverts assumptions that massive datasets are necessary, instead emphasizing:
378
+
379
+ - Iterative improvement with community feedback
380
+ - Narrative corrections (cultural context over simple dictionary entries)
381
+ - Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning
382
+
383
+ ### Conceptual Overview
384
+
385
+ **Community-in-the-Loop Model Distillation**:
386
+
387
+ 1. Start with a small dictionary/text set.
388
+ 2. Prompt an initial model.
389
+ 3. Let the community correct errors with storytelling and context, not just words.
390
+ 4. LoRA-based fine-tuning absorbs these narrative corrections.
391
+ 5. The model evolves iteratively, guided by cultural custodians.
392
+
393
+ ### Heart of the Approach
394
+
395
+ - **Intentional Errors**: Poke the model with tough or context-specific queries.
396
+ - **Narrative Corrections**: Rich cultural commentary instead of bare “right vs. wrong.”
397
+ - **Distillation Triplets**: (Prompt, Disallowed Reply, Narrative Reply).
398
+ - **Iterative Improvement**: If the model stumbles, revert and add more context.
399
+
400
+ ### LoRA Fine-Tuning
401
+
402
+ LoRA attaches small, low-rank matrices to the base model. This dramatically reduces compute and speeds up retraining:
403
+
404
+ - **Efficiency**: Fraction of resources required vs. full retraining
405
+ - **Focused Updates**: Capturing the “essence” of new knowledge
406
+ - **Rapid Iterations**: Frequent refinement without heavy overhead
407
+
408
+ ### Mathematical Foundations
409
+
410
+ If W0\mathbf{W}_0 is the base weight matrix, LoRA introduces ΔW=AB\Delta \mathbf{W} = \mathbf{A}\mathbf{B} with A∈Rd×r\mathbf{A} \in \mathbb{R}^{d \times r} and B∈Rr×k\mathbf{B} \in \mathbb{R}^{r \times k}, where r≪min⁡(d,k)r \ll \min(d,k). Loss functions track both linguistic and cultural accuracy (e.g., a “Cultural Authenticity Score”).
411
+
412
+ ### Mermaid Diagram
413
+
414
+ ```mermaid
415
+ graph TD
416
+ A[Initial Model] --> B[Generate Response]
417
+ B --> C{Correct?}
418
+ C -->|No| D[Community Correction]
419
+ D --> E[Create Distillation Triplet]
420
+ E --> F[LoRA Fine-Tuning]
421
+ F --> A
422
+ C -->|Yes| G[Validation]
423
+
424
+ ```
425
+
426
+ ### Cultural Integrity
427
+
428
+ Every correction preserves cultural norms—idioms, humor, oral traditions—and ensures the community wields control over the AI’s “mindset.”
429
+
430
+ ### Data Sources
431
+
432
+ A 10,000-word Stoney Nakoda dictionary and community textbooks serve as seeds. Community feedback enriches this data over time, weaving historical memory into the model.
433
+
434
+ ### Expanding the Concept
435
+
436
+ From a tiny dictionary to an AI that:
437
+
438
+ - **Understands context** (formal/informal usage)
439
+ - **Integrates cultural references** (stories, metaphors)
440
+ - **Remembers history** (ancestors, ceremonies, seasonal events)
441
+
442
+ ### Adaptive Checkpoints
443
+
444
+ - **Forward Progress**: Keep the new checkpoint if improved.
445
+ - **Reversion**: If degraded, roll back and increase context in corrections.
446
+ - **Convergence**: Repeat until stable authenticity and fluency metrics are met.
447
+
448
+ ### Example Workflow
449
+
450
+ 1. **Prompt**: “How to say ‘taste slightly with the tip of your tongue’ in Stoney?”
451
+ 2. **Model’s Flawed Reply**: “`supthîyach`” (incorrect).
452
+ 3. **Community Correction**: Shares the correct phrase plus a story from childhood.
453
+ 4. **Distillation Triplet**: (Prompt, Disallowed, Narrative).
454
+ 5. **LoRA Fine-Tuning**: Model adjusts swiftly.
455
+ 6. **Re-Evaluation**: Answers improve in subsequent queries.
456
+
457
+ ### Monitoring & QA
458
+
459
+ - **Cultural Authenticity Score (CAS)**
460
+ - **Linguistic Fluency** (perplexity, cross-entropy)
461
+ - **Validation Loops** (watch for regressions, revert if needed)
462
+
463
+ ### Future Directions
464
+
465
+ - **Oral Histories**: Model retells century-old stories.
466
+ - **Seasonal Knowledge**: Terms tied to ceremonies and ecological cycles.
467
+ - **Dialects/Accents**: Respecting sub-regional differences.
468
+ - **Educational Tools**: Interactive AI for language learning.
469
+ - **Ethical AI**: Centered on consent, community governance, cultural integrity.
470
+
471
+ ### Glossary
472
+
473
+ - **CAS**: Cultural Authenticity Score
474
+ - **Distillation Triplet**: (Prompt, Flawed Reply, Narrative Reply)
475
+ - **LoRA**: Low-Rank Adaptation
476
+ - **Community-in-the-Loop**: Paradigm of continuous human-guided refinement
477
+
478
+
479
+
480
+ "
requirements.txt ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dropbox
2
+ gradio
3
+ mistralai
4
+ anthropic
5
+ pinecone-client
6
+ nomic
7
+ openai
8
+ groq
9
+ langchain-community
10
+ replicate
11
+ google-generativeai
12
+ perplexityai
13
+ cohere
14
+ langchain
15
+ alpha_vantage
16
+ huggingface_hub
17
+ tavily-python
18
+ python-mathpix
19
+ requests
20
+ python-dotenv
21
+ sympy
22
+ numpy-financial
23
+ numpy
24
+ pandas
25
+ mesop
26
+ langchain-anthropic
27
+ langchain-openai
28
+ openpyxl
29
+ beautifulsoup4
30
+ langchain-mistralai
31
+ google-cloud-secret-manager
32
+ selenium
33
+ google-auth-oauthlib
34
+ google-auth-httplib2
35
+ google-api-python-client
36
+ PyPDF2
37
+ python-docx
38
+ markdown
39
+ Pygithub
40
+ GitPython
41
+ SpeechRecognition
42
+ google-cloud-storage
43
+ nltk
tools.py ADDED
@@ -0,0 +1,864 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ '''
2
+ Guidelines for Creating and Utilizing Tools in tools.py:
3
+
4
+ 1. Initial Assessment:
5
+
6
+ Review Existing Tools:
7
+ Before adding new functions, thoroughly read through tools.py to understand the existing tools and their functionalities.
8
+ Determine if an existing tool can be adapted or extended to meet the current needs, avoiding redundancy.
9
+ 2. Tool Creation and Function Design:
10
+
11
+ Create Within tools.py:
12
+
13
+ Add new tools as functions within tools.py. If tools.py doesn't exist, create it.
14
+ Ensure each function is self-contained and focused on a single task for modularity.
15
+ Design for Importing:
16
+
17
+ Design tools to be imported and executed via terminal commands. Do not include execution code that runs when tools.py is imported.
18
+ Follow Best Practices:
19
+
20
+ Use clear and descriptive function names that reflect their general purpose.
21
+ Include docstrings for each function, detailing the purpose, parameters, and expected outputs.
22
+ Adhere to PEP 8 style guidelines for readable and maintainable code.
23
+ 3. Generalization:
24
+
25
+ Broad Input Handling:
26
+
27
+ Design functions to handle a wide range of inputs, enhancing reusability for future tasks.
28
+ Accept parameters that allow the function to be applicable in various scenarios (e.g., any stock ticker, URL, or data file).
29
+ Flexible Functionality:
30
+
31
+ Ensure functions can process different data types and structures when applicable.
32
+ Avoid hardcoding values; use parameters and defaults where necessary.
33
+ Modularity:
34
+
35
+ If a task involves multiple distinct operations, split them into separate functions.
36
+ This approach enhances clarity and allows for individual functions to be reused independently.
37
+ 4. Execution and Script Management:
38
+
39
+ Import and Run via Terminal:
40
+
41
+ Do not execute tools.py directly. Instead, import the necessary functions and run them using the terminal.
42
+ Use the command:
43
+ bash
44
+ Copy code
45
+ python -c "from tools import function_name; function_name(args)"
46
+ Replace function_name and args with the appropriate function and arguments.
47
+ Avoid Additional Scripts:
48
+
49
+ Do not create extra .py files or scripts for execution purposes.
50
+ Keep all tool functions within tools.py and execute them using the import method shown above.
51
+ 5. Output:
52
+
53
+ Console Printing:
54
+
55
+ Ensure that all tools print their output directly to the console.
56
+ Format the output for readability, using clear messages and organizing data in a logical manner.
57
+ No Return Statements for Output:
58
+
59
+ While functions can return values for internal use, the primary results should be displayed using print() statements.
60
+ 6. Error Handling:
61
+
62
+ Input Validation:
63
+
64
+ Validate all input parameters to catch errors before execution.
65
+ Provide informative error messages to guide correct usage.
66
+ Exception Management:
67
+
68
+ Use try-except blocks to handle potential exceptions without crashing the program.
69
+ Log errors where appropriate, and ensure they don't expose sensitive information.
70
+ Debugging and Testing:
71
+
72
+ Test functions with various inputs, including edge cases, to ensure robustness.
73
+ If errors are found, revise and debug the functions promptly.
74
+ 7. Post-Creation:
75
+
76
+ Execute to Fulfill Requests:
77
+
78
+ After creating or updating tools, execute them as needed to fulfill user requests unless the request was solely for tool creation.
79
+ Documentation:
80
+
81
+ Update any relevant documentation or comments within tools.py to reflect new additions or changes.
82
+ Consider maintaining a usage example within the docstring for complex functions.
83
+ 8. Maintenance and Refactoring:
84
+
85
+ Regular Review:
86
+
87
+ Periodically review tools.py to identify opportunities for optimization and improvement.
88
+ Remove or update deprecated functions that are no longer effective or necessary.
89
+ Enhance Generalization:
90
+
91
+ Refactor functions to improve their general applicability as new requirements emerge.
92
+ Stay vigilant for patterns that can be abstracted into more general solutions.
93
+ 9. Compliance and Security:
94
+
95
+ Data Protection:
96
+
97
+ Ensure that tools handle data securely, especially when dealing with sensitive information.
98
+ Avoid hardcoding credentials or exposing private data through outputs.
99
+ Licensing and Dependencies:
100
+
101
+ Verify that any third-party libraries used are properly licensed and documented.
102
+ Include installation instructions for dependencies if they are not part of the standard library.
103
+ '''
104
+
105
+ # Your tools will be defined below this line
106
+
107
+ import os
108
+ from datetime import datetime, timedelta
109
+ import PyPDF2
110
+ from pathlib import Path
111
+ import json
112
+ import re
113
+ import requests
114
+ from typing import List, Dict, Optional, Union
115
+ import subprocess
116
+ import sys
117
+ from git import Repo
118
+
119
+ def get_current_month_folder() -> str:
120
+ """Returns the path to the current month's folder."""
121
+ base_path = r"C:\Users\admin\Dropbox\Current\2024"
122
+ current_month = datetime.now().strftime("%B") # Full month name
123
+ return os.path.join(base_path, current_month)
124
+
125
+ def get_pdfs_for_date(target_date: datetime = None) -> List[str]:
126
+ """
127
+ Finds all PDFs saved on a specific date in the current month's folder structure.
128
+ Args:
129
+ target_date: datetime object for the target date. If None, uses today's date.
130
+ Returns a list of full paths to PDF files.
131
+ """
132
+ if target_date is None:
133
+ target_date = datetime.now()
134
+
135
+ target_date_str = target_date.strftime("%Y-%m-%d")
136
+ month_folder = get_current_month_folder()
137
+ pdf_files = []
138
+
139
+ # Walk through all subdirectories
140
+ for root, _, files in os.walk(month_folder):
141
+ for file in files:
142
+ if file.lower().endswith('.pdf'):
143
+ file_path = os.path.join(root, file)
144
+ # Get file's modification time
145
+ mod_time = datetime.fromtimestamp(os.path.getmtime(file_path))
146
+ if mod_time.strftime("%Y-%m-%d") == target_date_str:
147
+ pdf_files.append(file_path)
148
+
149
+ return pdf_files
150
+
151
+ def extract_text_from_pdf(pdf_path: str) -> str:
152
+ """Extract text content from a PDF file."""
153
+ try:
154
+ with open(pdf_path, 'rb') as file:
155
+ reader = PyPDF2.PdfReader(file)
156
+ text = ""
157
+ for page in reader.pages:
158
+ text += page.extract_text() + "\n"
159
+ return text
160
+ except Exception as e:
161
+ print(f"Error processing {pdf_path}: {str(e)}")
162
+ return ""
163
+
164
+ def summarize_pdfs_for_date(target_date: datetime = None):
165
+ """
166
+ Main function to process and summarize PDFs for a specific date.
167
+ Args:
168
+ target_date: datetime object for the target date. If None, uses today's date.
169
+ Prints summary to console and saves to a JSON file.
170
+ """
171
+ if target_date is None:
172
+ target_date = datetime.now()
173
+
174
+ pdfs = get_pdfs_for_date(target_date)
175
+ if not pdfs:
176
+ print(f"No PDFs found for {target_date.strftime('%Y-%m-%d')}")
177
+ return
178
+
179
+ summaries = {}
180
+ for pdf_path in pdfs:
181
+ print(f"Processing: {pdf_path}")
182
+ text = extract_text_from_pdf(pdf_path)
183
+
184
+ # Basic summary: first 500 characters of text
185
+ summary = text[:500] + "..." if len(text) > 500 else text
186
+
187
+ # Store in dictionary with filename as key
188
+ filename = os.path.basename(pdf_path)
189
+ summaries[filename] = {
190
+ "path": pdf_path,
191
+ "summary": summary,
192
+ "processed_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
193
+ }
194
+
195
+ # Save summaries to JSON file in the current directory
196
+ output_dir = "summaries"
197
+ os.makedirs(output_dir, exist_ok=True)
198
+
199
+ output_file = os.path.join(output_dir, f"summaries_{target_date.strftime('%Y-%m-%d')}.json")
200
+ with open(output_file, 'w', encoding='utf-8') as f:
201
+ json.dump(summaries, f, indent=4, ensure_ascii=False)
202
+
203
+ print(f"\nProcessed {len(pdfs)} PDFs")
204
+ print(f"Summaries saved to: {output_file}")
205
+
206
+ # Print summaries to console
207
+ for filename, data in summaries.items():
208
+ print(f"\n{'='*80}\n{filename}")
209
+ print(f"Path: {data['path']}")
210
+ print(f"\nSummary:\n{data['summary'][:200]}...")
211
+
212
+ class WebSearchTool:
213
+ """
214
+ A tool for performing web searches using the Perplexity API.
215
+ Designed to be the default web search mechanism for context-requiring queries.
216
+ """
217
+
218
+ def __init__(self, api_key: Optional[str] = None):
219
+ """
220
+ Initialize the WebSearchTool.
221
+
222
+ Args:
223
+ api_key: Perplexity API key. If None, will try to get from environment variable.
224
+ """
225
+ self.api_key = api_key or os.getenv("PERPLEXITY_API_KEY")
226
+ if not self.api_key:
227
+ raise ValueError("Perplexity API key must be provided or set in PERPLEXITY_API_KEY environment variable")
228
+
229
+ self.headers = {
230
+ "Authorization": f"Bearer {self.api_key}",
231
+ "Content-Type": "application/json"
232
+ }
233
+
234
+ def search(self, query: str, max_results: int = 5) -> Dict:
235
+ """
236
+ Perform a web search using Perplexity API.
237
+
238
+ Args:
239
+ query: The search query
240
+ max_results: Maximum number of results to return
241
+
242
+ Returns:
243
+ Dictionary containing search results and metadata
244
+ """
245
+ try:
246
+ # Make the API request
247
+ response = requests.post(
248
+ "https://api.perplexity.ai/chat/completions",
249
+ headers=self.headers,
250
+ json={
251
+ "model": "llama-3.1-sonar-huge-128k-online",
252
+ "messages": [
253
+ {
254
+ "role": "system",
255
+ "content": "You are a helpful assistant that provides accurate and concise answers based on web search results."
256
+ },
257
+ {
258
+ "role": "user",
259
+ "content": query
260
+ }
261
+ ]
262
+ }
263
+ )
264
+ response.raise_for_status()
265
+ data = response.json()
266
+
267
+ # Extract answer from the response
268
+ answer = data.get("choices", [{}])[0].get("message", {}).get("content", "No answer found")
269
+
270
+ # Process and format the results
271
+ results = {
272
+ "query": query,
273
+ "timestamp": datetime.now().isoformat(),
274
+ "answer": answer,
275
+ "references": [], # References not available in this API version
276
+ "metadata": {
277
+ "source": "Perplexity API",
278
+ "model": "llama-3.1-sonar-huge-128k-online"
279
+ }
280
+ }
281
+ return results
282
+
283
+ except Exception as e:
284
+ print(f"Error performing search: {str(e)}")
285
+ return {
286
+ "query": query,
287
+ "timestamp": datetime.now().isoformat(),
288
+ "error": str(e),
289
+ "metadata": {
290
+ "source": "Perplexity API",
291
+ "status": "error"
292
+ }
293
+ }
294
+
295
+ def format_results(self, results: Dict, format: str = "text") -> str:
296
+ """
297
+ Format search results in the specified format.
298
+
299
+ Args:
300
+ results: Search results dictionary
301
+ format: Output format ("text" or "markdown")
302
+
303
+ Returns:
304
+ Formatted string of results
305
+ """
306
+ if "error" in results:
307
+ return f"Error: {results['error']}"
308
+
309
+ if format == "markdown":
310
+ output = f"# Search Results for: {results['query']}\n\n"
311
+ output += f"## Answer\n{results['answer']}\n\n"
312
+ if results['references']:
313
+ output += "## References\n"
314
+ for i, ref in enumerate(results['references'], 1):
315
+ output += f"{i}. {ref['title']} - {ref['url']}\n"
316
+ return output
317
+ else:
318
+ output = f"Search Results for: {results['query']}\n\n"
319
+ output += f"Answer:\n{results['answer']}\n\n"
320
+ if results['references']:
321
+ output += "References:\n"
322
+ for i, ref in enumerate(results['references'], 1):
323
+ output += f"{i}. {ref['title']} - {ref['url']}\n"
324
+ return output
325
+
326
+ class MCPServerManager:
327
+ """
328
+ A tool for installing and managing Model Context Protocol (MCP) servers.
329
+ Integrates with Claude desktop and manages server configurations.
330
+ """
331
+
332
+ DEFAULT_CONFIG_LOCATIONS = [
333
+ "mcp.json",
334
+ ".mcp/config.json",
335
+ "config/mcp.json",
336
+ "mcp_config.json",
337
+ ".config/mcp/servers.json"
338
+ ]
339
+
340
+ def __init__(self, base_dir: Optional[str] = None):
341
+ """
342
+ Initialize the MCP Server Manager.
343
+
344
+ Args:
345
+ base_dir: Base directory for installing servers. If None, uses current directory.
346
+ """
347
+ self.base_dir = Path(base_dir) if base_dir else Path.cwd() / "mcp_servers"
348
+ self.base_dir.mkdir(parents=True, exist_ok=True)
349
+ self.servers_repo_url = "https://github.com/modelcontextprotocol/servers.git"
350
+ self.installed_servers = {}
351
+ self.load_installed_servers()
352
+
353
+ def load_installed_servers(self):
354
+ """Load information about installed servers from the config file."""
355
+ config_file = self.base_dir / "config.json"
356
+ if config_file.exists():
357
+ with open(config_file, "r") as f:
358
+ self.installed_servers = json.load(f)
359
+
360
+ def save_installed_servers(self):
361
+ """Save information about installed servers to the config file."""
362
+ config_file = self.base_dir / "config.json"
363
+ with open(config_file, "w") as f:
364
+ json.dump(self.installed_servers, f, indent=4)
365
+
366
+ def get_featured_servers(self) -> List[Dict]:
367
+ """
368
+ Get list of featured servers from the MCP GitHub repository.
369
+
370
+ Returns:
371
+ List of server information dictionaries
372
+ """
373
+ try:
374
+ # Clone or update the servers repository
375
+ repo_dir = self.base_dir / "servers_repo"
376
+ if repo_dir.exists():
377
+ repo = Repo(repo_dir)
378
+ repo.remotes.origin.pull()
379
+ else:
380
+ repo = Repo.clone_from(self.servers_repo_url, repo_dir)
381
+
382
+ # First try to read from featured.json
383
+ featured_file = repo_dir / "featured.json"
384
+ if featured_file.exists():
385
+ with open(featured_file, "r", encoding="utf-8") as f:
386
+ return json.load(f)
387
+
388
+ # If featured.json doesn't exist, parse README.md
389
+ readme_file = repo_dir / "README.md"
390
+ if readme_file.exists():
391
+ servers = []
392
+ with open(readme_file, "r", encoding="utf-8", errors="ignore") as f:
393
+ content = f.read()
394
+ # Look for server repository links
395
+ repo_links = re.findall(r"\[([^\]]+)\]\((https://github.com/[^)]+)\)", content)
396
+ for name, url in repo_links:
397
+ if "/modelcontextprotocol/" in url:
398
+ servers.append({
399
+ "name": name,
400
+ "repository": url,
401
+ "config": {}
402
+ })
403
+ return servers
404
+
405
+ return []
406
+
407
+ except Exception as e:
408
+ print(f"Error getting featured servers: {str(e)}")
409
+ return []
410
+
411
+ def find_server_config(self, server_dir: Path) -> Optional[Dict]:
412
+ """
413
+ Search for MCP server configuration in common locations.
414
+
415
+ Args:
416
+ server_dir: Directory to search in
417
+
418
+ Returns:
419
+ Server configuration dictionary if found, None otherwise
420
+ """
421
+ # First check common config file locations
422
+ for config_path in self.DEFAULT_CONFIG_LOCATIONS:
423
+ config_file = server_dir / config_path
424
+ if config_file.exists():
425
+ try:
426
+ with open(config_file, "r", encoding="utf-8") as f:
427
+ config = json.load(f)
428
+ if "mcpServers" in config:
429
+ return config["mcpServers"]
430
+ except Exception as e:
431
+ print(f"Error reading config from {config_file}: {str(e)}")
432
+
433
+ # Check package.json for Node.js projects
434
+ package_json = server_dir / "package.json"
435
+ if package_json.exists():
436
+ try:
437
+ with open(package_json, "r", encoding="utf-8") as f:
438
+ config = json.load(f)
439
+ if "mcpServers" in config:
440
+ return config["mcpServers"]
441
+ except Exception as e:
442
+ print(f"Error reading config from package.json: {str(e)}")
443
+
444
+ # Check pyproject.toml for Python projects
445
+ pyproject_toml = server_dir / "pyproject.toml"
446
+ if pyproject_toml.exists():
447
+ try:
448
+ import tomli
449
+ with open(pyproject_toml, "rb") as f:
450
+ config = tomli.load(f)
451
+ if "tool" in config and "mcp" in config["tool"]:
452
+ return {"python": config["tool"]["mcp"]}
453
+ except ImportError:
454
+ print("tomli package not found, skipping pyproject.toml parsing")
455
+ except Exception as e:
456
+ print(f"Error reading config from pyproject.toml: {str(e)}")
457
+
458
+ return None
459
+
460
+ def install_server(self, server_name: str, custom_config: Optional[Dict] = None) -> bool:
461
+ """
462
+ Install a specific MCP server.
463
+
464
+ Args:
465
+ server_name: Name of the server to install
466
+ custom_config: Optional custom configuration for the server
467
+
468
+ Returns:
469
+ True if installation was successful, False otherwise
470
+ """
471
+ try:
472
+ # Get server information from featured servers
473
+ featured_servers = self.get_featured_servers()
474
+ server_info = next((s for s in featured_servers if s["name"] == server_name), None)
475
+ if not server_info:
476
+ print(f"Server '{server_name}' not found in featured servers")
477
+ return False
478
+
479
+ # Create server directory
480
+ server_dir = self.base_dir / server_name
481
+ server_dir.mkdir(exist_ok=True)
482
+
483
+ # Clone server repository
484
+ repo = Repo.clone_from(server_info["repository"], server_dir)
485
+
486
+ # Install dependencies
487
+ if (server_dir / "requirements.txt").exists():
488
+ subprocess.run([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"], cwd=server_dir)
489
+ elif (server_dir / "package.json").exists():
490
+ subprocess.run(["npm", "install"], cwd=server_dir)
491
+
492
+ # Find or use custom server configuration
493
+ config = custom_config or self.find_server_config(server_dir) or {}
494
+
495
+ # Store server information
496
+ self.installed_servers[server_name] = {
497
+ "path": str(server_dir),
498
+ "version": repo.head.commit.hexsha[:8],
499
+ "install_date": datetime.now().isoformat(),
500
+ "config": config
501
+ }
502
+ self.save_installed_servers()
503
+
504
+ print(f"Successfully installed {server_name}")
505
+ if config:
506
+ print(f"Found server configuration: {json.dumps(config, indent=2)}")
507
+ else:
508
+ print("No server configuration found. You may need to configure it manually.")
509
+
510
+ return True
511
+
512
+ except Exception as e:
513
+ print(f"Error installing server '{server_name}': {str(e)}")
514
+ return False
515
+
516
+ def uninstall_server(self, server_name: str) -> bool:
517
+ """
518
+ Uninstall a specific MCP server.
519
+
520
+ Args:
521
+ server_name: Name of the server to uninstall
522
+
523
+ Returns:
524
+ True if uninstallation was successful, False otherwise
525
+ """
526
+ try:
527
+ if server_name not in self.installed_servers:
528
+ print(f"Server '{server_name}' is not installed")
529
+ return False
530
+
531
+ # Remove server directory
532
+ server_dir = Path(self.installed_servers[server_name]["path"])
533
+ shutil.rmtree(server_dir)
534
+
535
+ # Remove from installed servers
536
+ del self.installed_servers[server_name]
537
+ self.save_installed_servers()
538
+
539
+ print(f"Successfully uninstalled {server_name}")
540
+ return True
541
+
542
+ except Exception as e:
543
+ print(f"Error uninstalling server '{server_name}': {str(e)}")
544
+ return False
545
+
546
+ def list_installed_servers(self) -> Dict[str, Dict]:
547
+ """
548
+ Get information about installed servers.
549
+
550
+ Returns:
551
+ Dictionary of installed server information
552
+ """
553
+ return self.installed_servers
554
+
555
+ def update_server(self, server_name: str) -> bool:
556
+ """
557
+ Update a specific MCP server to the latest version.
558
+
559
+ Args:
560
+ server_name: Name of the server to update
561
+
562
+ Returns:
563
+ True if update was successful, False otherwise
564
+ """
565
+ try:
566
+ if server_name not in self.installed_servers:
567
+ print(f"Server '{server_name}' is not installed")
568
+ return False
569
+
570
+ server_dir = Path(self.installed_servers[server_name]["path"])
571
+ repo = Repo(server_dir)
572
+
573
+ # Get current version
574
+ old_version = repo.head.commit.hexsha[:8]
575
+
576
+ # Pull latest changes
577
+ repo.remotes.origin.pull()
578
+
579
+ # Get new version
580
+ new_version = repo.head.commit.hexsha[:8]
581
+
582
+ # Update dependencies if needed
583
+ if (server_dir / "requirements.txt").exists():
584
+ subprocess.run([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"], cwd=server_dir)
585
+
586
+ # Update stored information
587
+ self.installed_servers[server_name]["version"] = new_version
588
+ self.save_installed_servers()
589
+
590
+ print(f"Updated {server_name} from {old_version} to {new_version}")
591
+ return True
592
+
593
+ except Exception as e:
594
+ print(f"Error updating server '{server_name}': {str(e)}")
595
+ return False
596
+
597
+ def configure_server(self, server_name: str, config: Dict) -> bool:
598
+ """
599
+ Configure a specific MCP server.
600
+
601
+ Args:
602
+ server_name: Name of the server to configure
603
+ config: Configuration dictionary in the format:
604
+ {
605
+ "command": str, # Command to run the server
606
+ "args": List[str], # Arguments for the command
607
+ "env": Dict[str, str], # Optional environment variables
608
+ "cwd": str, # Optional working directory
609
+ }
610
+
611
+ Returns:
612
+ True if configuration was successful, False otherwise
613
+ """
614
+ try:
615
+ if server_name not in self.installed_servers:
616
+ print(f"Server '{server_name}' is not installed")
617
+ return False
618
+
619
+ server_dir = Path(self.installed_servers[server_name]["path"])
620
+
621
+ # Validate configuration
622
+ if "command" not in config:
623
+ print("Error: Server configuration must include 'command'")
624
+ return False
625
+
626
+ # Update configuration
627
+ self.installed_servers[server_name]["config"] = config
628
+ self.save_installed_servers()
629
+
630
+ # Try to write configuration to a standard location
631
+ config_dir = server_dir / ".mcp"
632
+ config_dir.mkdir(exist_ok=True)
633
+ config_file = config_dir / "config.json"
634
+
635
+ with open(config_file, "w", encoding="utf-8") as f:
636
+ json.dump({"mcpServers": {server_name: config}}, f, indent=2)
637
+
638
+ print(f"Successfully configured {server_name}")
639
+ print(f"Configuration saved to {config_file}")
640
+ return True
641
+
642
+ except Exception as e:
643
+ print(f"Error configuring server '{server_name}': {str(e)}")
644
+ return False
645
+
646
+ def start_server(self, server_name: str) -> bool:
647
+ """
648
+ Start a specific MCP server.
649
+
650
+ Args:
651
+ server_name: Name of the server to start
652
+
653
+ Returns:
654
+ True if server was started successfully, False otherwise
655
+ """
656
+ try:
657
+ if server_name not in self.installed_servers:
658
+ print(f"Server '{server_name}' is not installed")
659
+ return False
660
+
661
+ server_info = self.installed_servers[server_name]
662
+ config = server_info.get("config", {})
663
+
664
+ if not config:
665
+ print(f"Server '{server_name}' is not configured")
666
+ return False
667
+
668
+ # Prepare command and arguments
669
+ command = config.get("command")
670
+ args = config.get("args", [])
671
+ env = {**os.environ, **(config.get("env", {}))}
672
+ cwd = config.get("cwd") or server_info["path"]
673
+
674
+ # Start the server process
675
+ process = subprocess.Popen(
676
+ [command, *args],
677
+ env=env,
678
+ cwd=cwd,
679
+ stdout=subprocess.PIPE,
680
+ stderr=subprocess.PIPE,
681
+ text=True
682
+ )
683
+
684
+ # Store process information
685
+ self.installed_servers[server_name]["process"] = process
686
+ self.save_installed_servers()
687
+
688
+ print(f"Started server '{server_name}' (PID: {process.pid})")
689
+ return True
690
+
691
+ except Exception as e:
692
+ print(f"Error starting server '{server_name}': {str(e)}")
693
+ return False
694
+
695
+ def stop_server(self, server_name: str) -> bool:
696
+ """
697
+ Stop a specific MCP server.
698
+
699
+ Args:
700
+ server_name: Name of the server to stop
701
+
702
+ Returns:
703
+ True if server was stopped successfully, False otherwise
704
+ """
705
+ try:
706
+ if server_name not in self.installed_servers:
707
+ print(f"Server '{server_name}' is not installed")
708
+ return False
709
+
710
+ server_info = self.installed_servers[server_name]
711
+ process = server_info.get("process")
712
+
713
+ if not process:
714
+ print(f"Server '{server_name}' is not running")
715
+ return False
716
+
717
+ # Try to stop the process gracefully
718
+ process.terminate()
719
+ try:
720
+ process.wait(timeout=5)
721
+ except subprocess.TimeoutExpired:
722
+ process.kill()
723
+
724
+ # Remove process information
725
+ del self.installed_servers[server_name]["process"]
726
+ self.save_installed_servers()
727
+
728
+ print(f"Stopped server '{server_name}'")
729
+ return True
730
+
731
+ except Exception as e:
732
+ print(f"Error stopping server '{server_name}': {str(e)}")
733
+ return False
734
+
735
+ async def generate_pyflowchart(repo_path: str, output_dir: Optional[str] = None) -> None:
736
+ """
737
+ Generate flowcharts for all Python files in a repository using pyflowchart.
738
+ Creates HTML flowcharts that can be viewed in a browser.
739
+
740
+ Args:
741
+ repo_path: Path to the repository
742
+ output_dir: Optional directory to save flowcharts. If None, creates a 'flowcharts' directory in the repo.
743
+ """
744
+ try:
745
+ # Ensure pyflowchart is installed
746
+ subprocess.run([sys.executable, "-m", "pip", "install", "pyflowchart"], check=True)
747
+
748
+ # Set up output directory
749
+ if output_dir is None:
750
+ output_dir = os.path.join(repo_path, 'flowcharts')
751
+ os.makedirs(output_dir, exist_ok=True)
752
+
753
+ # Generate HTML flowcharts
754
+ for root, _, files in os.walk(repo_path):
755
+ for file in files:
756
+ if file.endswith('.py'):
757
+ py_file = os.path.join(root, file)
758
+
759
+ # Skip empty files
760
+ if os.path.getsize(py_file) == 0:
761
+ print(f"Skipping empty file: {py_file}")
762
+ continue
763
+
764
+ # Check if file has actual Python code
765
+ with open(py_file, 'r', encoding='utf-8') as f:
766
+ content = f.read().strip()
767
+ if not content:
768
+ print(f"Skipping empty file: {py_file}")
769
+ continue
770
+
771
+ html_file = os.path.join(output_dir, f"{os.path.splitext(file)[0]}_flowchart.html")
772
+
773
+ # Generate flowchart HTML
774
+ print(f"Generating flowchart for {py_file}")
775
+ try:
776
+ subprocess.run([
777
+ sys.executable, "-m", "pyflowchart", py_file,
778
+ "--output", html_file
779
+ ], check=True)
780
+ print(f"Saved HTML to {html_file}")
781
+ except subprocess.CalledProcessError as e:
782
+ print(f"Error generating flowchart for {py_file}: {str(e)}")
783
+ continue
784
+
785
+ print(f"\nFlowcharts generated in: {output_dir}")
786
+
787
+ except subprocess.CalledProcessError as e:
788
+ print(f"Error running pyflowchart: {str(e)}")
789
+ except Exception as e:
790
+ print(f"Error generating flowcharts: {str(e)}")
791
+
792
+ def extract_repo_context(repo_url: str, output_file: Optional[str] = None) -> None:
793
+ """
794
+ Extract repository context using gitingest.com.
795
+
796
+ Args:
797
+ repo_url: GitHub repository URL
798
+ output_file: Optional file to save the context. If None, uses repo name with .txt extension.
799
+ """
800
+ try:
801
+ # Convert github.com URL to gitingest.com
802
+ if "github.com" not in repo_url:
803
+ raise ValueError("Only GitHub repositories are supported")
804
+
805
+ ingest_url = repo_url.replace("github.com", "gitingest.com")
806
+ print(f"Fetching repository context from: {ingest_url}")
807
+
808
+ # Make request to gitingest.com
809
+ response = requests.get(ingest_url)
810
+ response.raise_for_status()
811
+
812
+ # Extract content
813
+ content = response.text
814
+
815
+ # Save to file
816
+ if output_file is None:
817
+ repo_name = repo_url.split('/')[-1].replace('.git', '')
818
+ output_file = f"{repo_name}_context.txt"
819
+
820
+ with open(output_file, 'w', encoding='utf-8') as f:
821
+ f.write(content)
822
+
823
+ print(f"Repository context saved to: {output_file}")
824
+
825
+ except requests.RequestException as e:
826
+ print(f"Error fetching repository context: {str(e)}")
827
+ except Exception as e:
828
+ print(f"Error extracting repository context: {str(e)}")
829
+
830
+ async def post_clone_actions(repo_url: str) -> None:
831
+ """
832
+ Perform post-clone actions after a git clone operation:
833
+ 1. Generate flowcharts for all Python files using pyflowchart
834
+ 2. Extract repository context using gitingest.com
835
+
836
+ Args:
837
+ repo_url: URL of the repository that was just cloned
838
+ """
839
+ try:
840
+ # Get the repository name from the URL
841
+ repo_name = repo_url.split('/')[-1].replace('.git', '')
842
+ repo_path = os.path.join(os.getcwd(), repo_name)
843
+
844
+ # Generate flowcharts
845
+ print("\nGenerating flowcharts...")
846
+ await generate_pyflowchart(repo_path)
847
+
848
+ # Extract repository context
849
+ print("\nExtracting repository context...")
850
+ extract_repo_context(repo_url)
851
+
852
+ except Exception as e:
853
+ print(f"Error in post-clone actions: {str(e)}")
854
+
855
+ # Example usage:
856
+ if __name__ == "__main__":
857
+ # Initialize the search tool
858
+ search_tool = WebSearchTool()
859
+
860
+ # Perform a simple search
861
+ results = search_tool.search("Latest developments in AI technology")
862
+
863
+ # Print formatted results
864
+ print(search_tool.format_results(results, "markdown"))