Spaces:
Sleeping
Sleeping
Christian H. Cooper
commited on
Commit
·
f8a1b9b
0
Parent(s):
ready
Browse files- .clinerules +122 -0
- .gitignore +146 -0
- .huggingface-space +9 -0
- README.md +33 -0
- app.py +589 -0
- instructions.txt +480 -0
- requirements.txt +43 -0
- tools.py +864 -0
.clinerules
ADDED
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Guidelines for Creating and Utilizing Tools in tools.py:
|
2 |
+
|
3 |
+
1. Initial Assessment:
|
4 |
+
- Before creating new tools, read through tools.py to understand the existing tools and their functionalities.
|
5 |
+
|
6 |
+
2. Tool Creation:
|
7 |
+
- Create new tools as functions within tools.py. If tools.py doesn't exist, create it.
|
8 |
+
- Ensure tools are designed to be imported and executed via terminal commands, not run directly.
|
9 |
+
|
10 |
+
3. Function Design:
|
11 |
+
- Develop tools for tasks requiring precision or those not easily executable manually.
|
12 |
+
- Make tools generalizable to handle a wide range of inputs, ensuring reusability for future tasks.
|
13 |
+
- For example, instead of creating a function for a specific stock or URL, design it to accept any stock ticker or URL as an argument.
|
14 |
+
- Name functions to reflect their general nature, ensuring they are not limited to a specific use case. This enhances flexibility and adaptability for future applications.
|
15 |
+
|
16 |
+
4. Output:
|
17 |
+
- Tools must always print their output.
|
18 |
+
|
19 |
+
5. Execution:
|
20 |
+
- Do not run tools.py directly. Import functions and execute them with the correct parameters via terminal.
|
21 |
+
- Always use the `python -c "..."` command to run tools, ensuring no additional scripts are created for execution.
|
22 |
+
|
23 |
+
6. Generalization:
|
24 |
+
- Thoroughly assess the potential range of inputs and design functions to accommodate the broadest possible spectrum of arguments.
|
25 |
+
- Design functions to accept parameters that cover the most general cases, allowing them to handle a wide variety of scenarios.
|
26 |
+
- Ensure that functions can handle various data types and structures, allowing for maximum flexibility and adaptability.
|
27 |
+
- If a request involves distinct tasks, create separate functions for each to maintain clarity and modularity.
|
28 |
+
- Regularly review and refactor functions to enhance their generalization capabilities as new requirements emerge.
|
29 |
+
|
30 |
+
7. Error Handling:
|
31 |
+
- If errors occur, rewrite functions to resolve them.
|
32 |
+
|
33 |
+
8. Script Management:
|
34 |
+
- Never delete existing content in tools.py as it is a standing script used by the system. You can add tools to create your own functionality and this is encouraged.
|
35 |
+
- All new functionality should be executed via python -c commands rather than modifying tools.py.
|
36 |
+
- Avoid creating additional .py scripts for function execution. Always import and run with proper arguments using the `python -c "..."` command.
|
37 |
+
|
38 |
+
9. Post-Creation:
|
39 |
+
- After creating tools, execute them to fulfill user requests unless the request was solely for tool creation.
|
40 |
+
|
41 |
+
# Git Smart Clone Instructions
|
42 |
+
|
43 |
+
## Overview
|
44 |
+
This enhanced git cloning system provides automatic code visualization and context extraction for any public repository. When you clone a repository using `git-smartclone`, it will:
|
45 |
+
|
46 |
+
1. Clone the repository normally
|
47 |
+
2. Generate interactive HTML flowcharts for all Python files
|
48 |
+
3. Extract repository context using gitingest.com
|
49 |
+
|
50 |
+
## Installation
|
51 |
+
1. Ensure you have Python 3.x installed
|
52 |
+
2. Install required packages:
|
53 |
+
```bash
|
54 |
+
pip install pyflowchart requests gitpython
|
55 |
+
```
|
56 |
+
3. Add the ProjectTemplates directory to your PATH
|
57 |
+
4. Copy git-smartclone.ps1 to your ProjectTemplates directory
|
58 |
+
|
59 |
+
## Usage
|
60 |
+
Instead of using regular `git clone`, use:
|
61 |
+
```powershell
|
62 |
+
git-smartclone <repository-url>
|
63 |
+
```
|
64 |
+
|
65 |
+
Example:
|
66 |
+
```powershell
|
67 |
+
git-smartclone https://github.com/username/repo.git
|
68 |
+
```
|
69 |
+
|
70 |
+
## What You Get
|
71 |
+
After cloning, you'll find:
|
72 |
+
|
73 |
+
1. The cloned repository in your current directory
|
74 |
+
2. A `flowcharts` directory inside the repository containing:
|
75 |
+
- Interactive HTML flowcharts for each Python file
|
76 |
+
- Open these in your browser to see visual code representations
|
77 |
+
- Click elements to explore the code structure
|
78 |
+
- Export options for PNG/SVG if needed
|
79 |
+
|
80 |
+
3. A `{repo-name}_context.txt` file containing:
|
81 |
+
- Repository context from gitingest.com
|
82 |
+
- Code architecture insights
|
83 |
+
- Key file and directory explanations
|
84 |
+
|
85 |
+
## Viewing Results
|
86 |
+
1. Flowcharts:
|
87 |
+
- Navigate to the `flowcharts` directory
|
88 |
+
- Open any `*_flowchart.html` file in your browser
|
89 |
+
- Interactive elements allow you to:
|
90 |
+
* Zoom in/out
|
91 |
+
* Pan around
|
92 |
+
* Click nodes to see details
|
93 |
+
* Export as PNG/SVG
|
94 |
+
|
95 |
+
2. Repository Context:
|
96 |
+
- Open `{repo-name}_context.txt`
|
97 |
+
- Contains AI-generated insights about the codebase
|
98 |
+
- Helps understand the repository structure
|
99 |
+
|
100 |
+
## Benefits
|
101 |
+
- Instant code visualization
|
102 |
+
- Better understanding of code flow
|
103 |
+
- Quick repository context
|
104 |
+
- Time-saving code exploration
|
105 |
+
- Enhanced code comprehension
|
106 |
+
|
107 |
+
## Notes
|
108 |
+
- Works best with Python repositories
|
109 |
+
- Requires internet connection for gitingest.com
|
110 |
+
- Large repositories may take longer to process
|
111 |
+
- Empty Python files are automatically skipped
|
112 |
+
|
113 |
+
## Troubleshooting
|
114 |
+
If flowcharts aren't generating:
|
115 |
+
1. Ensure pyflowchart is installed: `pip install pyflowchart`
|
116 |
+
2. Check Python file isn't empty
|
117 |
+
3. Verify file has valid Python syntax
|
118 |
+
|
119 |
+
If context extraction fails:
|
120 |
+
1. Verify repository URL is public
|
121 |
+
2. Check internet connection
|
122 |
+
3. Ensure URL is from github.com
|
.gitignore
ADDED
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Byte-compiled / optimized / DLL files
|
2 |
+
__pycache__/
|
3 |
+
*.py[cod]
|
4 |
+
*$py.class
|
5 |
+
|
6 |
+
# C extensions
|
7 |
+
*.so
|
8 |
+
|
9 |
+
# Virtual environments
|
10 |
+
# Uncomment the one you use or add your own
|
11 |
+
# If using virtualenv or venv:
|
12 |
+
venv/
|
13 |
+
env/
|
14 |
+
# If using Pipenv:
|
15 |
+
Pipenv/
|
16 |
+
# If using poetry:
|
17 |
+
.poetry/
|
18 |
+
# If using conda:
|
19 |
+
envs/
|
20 |
+
.conda/
|
21 |
+
# If using virtualenvwrapper:
|
22 |
+
.venv/
|
23 |
+
|
24 |
+
# Distribution / packaging
|
25 |
+
build/
|
26 |
+
develop-eggs/
|
27 |
+
dist/
|
28 |
+
downloads/
|
29 |
+
eggs/
|
30 |
+
.eggs/
|
31 |
+
lib/
|
32 |
+
lib64/
|
33 |
+
parts/
|
34 |
+
sdist/
|
35 |
+
var/
|
36 |
+
*.egg-info/
|
37 |
+
.installed.cfg
|
38 |
+
*.egg
|
39 |
+
|
40 |
+
# Installer logs
|
41 |
+
pip-log.txt
|
42 |
+
pip-delete-this-directory.txt
|
43 |
+
|
44 |
+
# PyInstaller
|
45 |
+
# Usually these files are written by a python script from a template
|
46 |
+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
|
47 |
+
*.manifest
|
48 |
+
*.spec
|
49 |
+
|
50 |
+
# Unit test / coverage reports
|
51 |
+
htmlcov/
|
52 |
+
.tox/
|
53 |
+
.nox/
|
54 |
+
.coverage
|
55 |
+
.coverage.*
|
56 |
+
.cache
|
57 |
+
nosetests.xml
|
58 |
+
coverage.xml
|
59 |
+
*.cover
|
60 |
+
*.py,cover
|
61 |
+
.hypothesis/
|
62 |
+
.pytest_cache/
|
63 |
+
|
64 |
+
# Translations
|
65 |
+
*.mo
|
66 |
+
*.pot
|
67 |
+
|
68 |
+
# Django stuff:
|
69 |
+
*.log
|
70 |
+
local_settings.py
|
71 |
+
db.sqlite3
|
72 |
+
db.sqlite3-journal
|
73 |
+
|
74 |
+
# Flask stuff:
|
75 |
+
instance/
|
76 |
+
.webassets-cache
|
77 |
+
|
78 |
+
# Scrapy stuff:
|
79 |
+
.scrapy
|
80 |
+
|
81 |
+
# Sphinx documentation
|
82 |
+
docs/_build/
|
83 |
+
|
84 |
+
# PyBuilder
|
85 |
+
target/
|
86 |
+
|
87 |
+
# Jupyter Notebook
|
88 |
+
.ipynb_checkpoints
|
89 |
+
|
90 |
+
# IPython
|
91 |
+
profile_default/
|
92 |
+
ipython_config.py
|
93 |
+
|
94 |
+
# pyenv
|
95 |
+
.python-version
|
96 |
+
|
97 |
+
# celery beat schedule file
|
98 |
+
celerybeat-schedule
|
99 |
+
|
100 |
+
# SageMath parsed files
|
101 |
+
*.sage.py
|
102 |
+
|
103 |
+
# Environments
|
104 |
+
.env
|
105 |
+
.venv
|
106 |
+
env/
|
107 |
+
venv/
|
108 |
+
ENV/
|
109 |
+
env.bak/
|
110 |
+
venv.bak/
|
111 |
+
|
112 |
+
# Spyder project settings
|
113 |
+
.spyderproject
|
114 |
+
.spyproject
|
115 |
+
|
116 |
+
# Rope project settings
|
117 |
+
.ropeproject
|
118 |
+
|
119 |
+
# mkdocs documentation
|
120 |
+
/site
|
121 |
+
|
122 |
+
# mypy
|
123 |
+
.mypy_cache/
|
124 |
+
.dmypy.json
|
125 |
+
dmypy.json
|
126 |
+
|
127 |
+
# Pyre type checker
|
128 |
+
.pyre/
|
129 |
+
|
130 |
+
# VSCode settings
|
131 |
+
.vscode/
|
132 |
+
|
133 |
+
# PyCharm settings
|
134 |
+
.idea/
|
135 |
+
|
136 |
+
# MacOS files
|
137 |
+
.DS_Store
|
138 |
+
|
139 |
+
# Windows thumbnail cache
|
140 |
+
Thumbs.db
|
141 |
+
|
142 |
+
# Optional: Ignore coverage reports
|
143 |
+
coverage/
|
144 |
+
|
145 |
+
# Optional: Ignore node_modules if using frontend tools
|
146 |
+
node_modules/
|
.huggingface-space
ADDED
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
title: Ask About Stoney
|
2 |
+
emoji: 🗣️
|
3 |
+
colorFrom: blue
|
4 |
+
colorTo: indigo
|
5 |
+
sdk: gradio
|
6 |
+
sdk_version: 4.19.2
|
7 |
+
python_version: 3.10
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
README.md
ADDED
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Ask About Stoney - Interactive AI Assistant
|
2 |
+
|
3 |
+
This Gradio app provides an interactive interface to learn about and discuss the revolutionary Stoney language preservation project by Christian H. Cooper. The app uses the Gemini AI model to provide detailed responses about the project's approach to language preservation using AI and community involvement.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+
- Automatic initial analysis of the Stoney language project
|
8 |
+
- Interactive chat interface for follow-up questions
|
9 |
+
- Streaming responses for real-time interaction
|
10 |
+
- Example questions to get started
|
11 |
+
|
12 |
+
## Usage
|
13 |
+
|
14 |
+
Simply open the app to see an initial analysis of the project. You can then ask follow-up questions about any aspect of the project, including:
|
15 |
+
- The Community-in-the-Loop approach
|
16 |
+
- Technical implementation details
|
17 |
+
- Implications for language preservation
|
18 |
+
- Cultural impact and considerations
|
19 |
+
|
20 |
+
## Technology
|
21 |
+
|
22 |
+
Built with:
|
23 |
+
- Gradio - For the interactive web interface
|
24 |
+
- Google Gemini AI - For generating responses
|
25 |
+
- Python - Core application logic
|
26 |
+
|
27 |
+
## Deployment
|
28 |
+
|
29 |
+
This app is deployed on Hugging Face Spaces, providing easy access and reliable performance.
|
30 |
+
|
31 |
+
## About the Project
|
32 |
+
|
33 |
+
This interface provides insights into Christian H. Cooper's groundbreaking work on preserving the Stoney Nakoda language through AI and community involvement. The project demonstrates how modern AI techniques can be leveraged for language preservation while keeping the community at the center of the process.
|
app.py
ADDED
@@ -0,0 +1,589 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import os
|
2 |
+
import gradio as gr
|
3 |
+
import google.generativeai as genai
|
4 |
+
from dotenv import load_dotenv
|
5 |
+
|
6 |
+
# Load environment variables
|
7 |
+
load_dotenv()
|
8 |
+
|
9 |
+
# Configure Gemini
|
10 |
+
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
|
11 |
+
model = genai.GenerativeModel('gemini-2.0-flash-thinking-exp-01-21')
|
12 |
+
|
13 |
+
# The initial prompt that will be hidden from users
|
14 |
+
INITIAL_PROMPT = """based on the totality of your training, how revolutionary is this project by Christian H. Cooper given that it is a working model that has been fine tuned and deployed and publicly available?
|
15 |
+
|
16 |
+
# From Whispers to Voices: A "Community-In-The-Loop" Proposal for Model Distillation and Language Preservation
|
17 |
+
|
18 |
+
New Years Day, 2025
|
19 |
+
|
20 |
+
A working model of the Stoney Nakoda language has been developed and is now available for community-in-the-loop testing in 2025:
|
21 |
+
|
22 |
+
- **Model App**: [Stoney Language Model App](https://huggingface.co/spaces/HarleyCooper/StoneyApp)
|
23 |
+
- **Training Data**: [StoneyNakoda Training Dataset](https://huggingface.co/datasets/HarleyCooper/StoneyNakoda/blob/main/zSTONEY1_TRAINING_SET.jsonl)
|
24 |
+
|
25 |
+
Any First Nations community seeking to apply this approach to their own language is warmly invited to reach out.
|
26 |
+
|
27 |
+
By following this code, you can build a model for any low-resource language. The starting dictionary size should be ~8,000 words.
|
28 |
+
|
29 |
+
---
|
30 |
+
|
31 |
+
## Table of Contents
|
32 |
+
|
33 |
+
1. [New Years Day, Canadian Rockies, 2025](#introduction)
|
34 |
+
2. [Understanding How AI Learns Stoney Words Using Cosine Similarity](#understanding-how-ai-learns-stoney-words-using-cosine-similarity)
|
35 |
+
3. [Project Architecture](#project-architecture)
|
36 |
+
- [High-Level System Design](#high-level-system-design)
|
37 |
+
- [Data Flow](#data-flow)
|
38 |
+
4. [Detailed Project Structure](#detailed-project-structure)
|
39 |
+
5. [Core Components](#core-components)
|
40 |
+
- [Data Generation & Processing](#data-generation--processing)
|
41 |
+
- [Model Training](#model-training)
|
42 |
+
6. [Comprehensive Setup Instructions](#comprehensive-setup-instructions)
|
43 |
+
- [System Requirements](#system-requirements)
|
44 |
+
- [Environment Setup](#environment-setup)
|
45 |
+
- [Configuration](#configuration)
|
46 |
+
- [Initialization](#initialization)
|
47 |
+
7. [Detailed Usage Pipeline](#detailed-usage-pipeline)
|
48 |
+
1. [Generate Training Data](#1-generate-training-data)
|
49 |
+
2. [Prepare Fine-tuning Data](#2-prepare-fine-tuning-data)
|
50 |
+
3. [Fine-tune Model](#3-fine-tune-model)
|
51 |
+
8. [Advanced Model Configuration](#advanced-model-configuration)
|
52 |
+
- [OpenAI Models](#openai-models)
|
53 |
+
- [Google Gemini](#google-gemini)
|
54 |
+
- [Hyperparameters](#hyperparameters)
|
55 |
+
9. [Comprehensive Data Formats](#comprehensive-data-formats)
|
56 |
+
- [Dictionary Format](#dictionary-format)
|
57 |
+
- [Q&A Format](#qa-format)
|
58 |
+
- [OpenAI Training Format](#openai-training-format)
|
59 |
+
10. [Development Guidelines](#development-guidelines)
|
60 |
+
11. [Contributing](#contributing)
|
61 |
+
12. [License](#license)
|
62 |
+
13. [Acknowledgments](#acknowledgments)
|
63 |
+
14. [The Community-in-the-Loop Revolution](#the-community-in-the-loop-revolution)
|
64 |
+
- [Introduction](#introduction-1)
|
65 |
+
- [Conceptual Overview](#conceptual-overview)
|
66 |
+
- [Heart of the Approach](#heart-of-the-approach)
|
67 |
+
- [LoRA Fine-Tuning](#lora-fine-tuning)
|
68 |
+
- [Mathematical Foundations](#mathematical-foundations)
|
69 |
+
- [Mermaid Diagram](#mermaid-diagram)
|
70 |
+
- [Cultural Integrity](#cultural-integrity)
|
71 |
+
- [Data Sources](#data-sources)
|
72 |
+
- [Expanding the Concept](#expanding-the-concept)
|
73 |
+
- [Adaptive Checkpoints](#adaptive-checkpoints)
|
74 |
+
- [Example Workflow](#example-workflow)
|
75 |
+
- [Monitoring & QA](#monitoring--qa)
|
76 |
+
- [Future Directions](#future-directions)
|
77 |
+
- [Glossary](#glossary)
|
78 |
+
|
79 |
+
---
|
80 |
+
|
81 |
+
## Introduction
|
82 |
+
|
83 |
+
In my office, there is a murder; a map of one, at least.
|
84 |
+
|
85 |
+
![Dawson's Map of the Bow Valley](Public/FullDawsonMap.jpg)
|
86 |
+
|
87 |
+
George Mercer Dawson explored the Bow Valley in the late 1800s, noting language on the British Columbia side. His map, though richly colored, stands like a tombstone over the Bow Valley where the Stoney people lived because he made no notes on their language and simply noted the people as "recent immigrants"
|
88 |
+
|
89 |
+
![Detail of Dawson Map](Public/dawsondetail.jpg)
|
90 |
+
|
91 |
+
What is very obvious from the linguistic patterns among the Haida, Tshimsia, Thlinkit, Kwakiool and Kawitshin dialects nearby is that languages blend like “linguistic DNA,” and machine learning could help trace faint threads of lost speech to their roots. Where some see isolation as a curse, in the age of AI, Stoney’s isolation turns out to be its strength.
|
92 |
+
|
93 |
+
For about two years, I thought about the size of the vector space that would be needed to get a model to self-train on a set of 100% indigenous data, and how that model could refine its grasp of the broader Stoney Language. This is now publicly and freely available.
|
94 |
+
|
95 |
+
|
96 |
+
Two key releases influenced my thinking of what was possible:
|
97 |
+
|
98 |
+
1. [Meta’s Llama-3 Model (April 18th, 2024)](https://www.reuters.com/technology/meta-releases-early-versions-its-llama-3-ai-model-2024-04-18/)
|
99 |
+
2. [OpenAI Fine-Tuning API (October 2024)](https://openai.com/index/api-model-distillation/)
|
100 |
+
|
101 |
+
Both gave me the motivation to build what’s presented here. The true innovation here lies in how communities can narratively correct the initially flawed response (about 10% of the time, the model works every time.) then that feeback be passed seamleslly back into the fine-tuning process. The [textbooks](https://globalnews.ca/news/9430501/stoney-nakota-language-textbook/) that the Stoney community created—intended as educational tools—became perfect concept of a model prompts, each chapter or word offering pure indigenous data devoid of external weights or biases to the fine-tuning process.
|
102 |
+
|
103 |
+
|
104 |
+
Early in 2023, I found an original, unpublished sketch by James Hector likely drawn in the summer of 1858 or 1859 along the Bow River in Southern Alberta:
|
105 |
+
|
106 |
+
![Sketch by James Hector of a Stoney Woman](Public/StoneyWoman.jpg)
|
107 |
+
|
108 |
+
Finding this, and already aware of George Mercer Dawson's work on First Nation's language on the British Columbia side, I was inspired to put the effort in and build a working model of the language and implement the Community-In-The-Loop distillation method.
|
109 |
+
|
110 |
+
This sketch shifted my thinking from considering the "Stoney People” to this "Stoney Woman” who saw these same mountains and rivers I see everyday, yet who had a very different way to think about and communicate to the world around her. The Community-in-the-Loop model distillation will quickly converge this initial model toward fluencey. I suspect this will require the community to correct about 80,000 question and answer pairs and would cost less than $800 in OpenAI computing power. Recent releases by Google and the Chinese Lab DeepSeek, could effectively reduce the cost to zero.
|
111 |
+
|
112 |
+
I think what this project has left me considering most ist that a century from now, strangers will live in all our homes and most of what we worry about today will not matter. But we can honor “Stoney Woman” by making sure her language endures, forging a living record in an age of AI. Incredibly, this tool will work with any first nations language, as long as there is a starting dictionary of about 8,000 words.
|
113 |
+
|
114 |
+
**I am freely available to help any First Nation in Canada.**
|
115 |
+
|
116 |
+
## Understanding How AI Learns Stoney Words Using Cosine Similarity
|
117 |
+
|
118 |
+
Word Embeddings: Mapping Words in Space
|
119 |
+
Word embeddings are like placing words in a high-dimensional map, where similar words are positioned closer together. For example, "strawberry," "orange," and "cherry" might form a cluster because they are fruits, while "laptop," "Microsoft," and "Android" might cluster elsewhere as tech-related terms. Each axis in this space represents a characteristic of the words, such as their context or meaning.
|
120 |
+
|
121 |
+
Context Shapes Meaning
|
122 |
+
A word's position in this space isn't fixed—it shifts based on context. For instance, the word "apple" could mean a fruit or the tech brand, depending on its surrounding words, like "buy" (tech) or "tree" (fruit). This dynamic placement captures the nuances of meaning.
|
123 |
+
|
124 |
+
Cosine Similarity: Measuring Relationships
|
125 |
+
Cosine similarity quantifies how similar two words are by measuring the angle between their vectors in the embedding space:
|
126 |
+
|
127 |
+
- Similar words have vectors pointing in nearly the same direction (cosine similarity close to 1)
|
128 |
+
- Unrelated words have vectors at a right angle (cosine similarity near 0)
|
129 |
+
- Opposite meanings have vectors pointing in opposite directions (cosine similarity close to -1)
|
130 |
+
- For example, "cherry" and "orange" might have a similarity of 0.97, while "cherry" and "laptop" might score 0.24
|
131 |
+
|
132 |
+
How AI Learns Stoney Words
|
133 |
+
|
134 |
+
- **Stoney Dictionary as a Starting Point:**
|
135 |
+
The AI begins with a structured dictionary of Stoney words, including translations, categories, pronunciations, and cultural context.
|
136 |
+
|
137 |
+
- **Community Feedback for Learning:**
|
138 |
+
The AI makes initial translations, which are often incorrect. Stoney speakers provide corrections, enriched with cultural context, stories, and humor. This feedback helps refine the AI's understanding.
|
139 |
+
|
140 |
+
The Role of Cosine Similarity in AI Learning
|
141 |
+
|
142 |
+
- The AI uses word embeddings to group Stoney words based on their meaning. For example, it determines whether a word belongs to a category like "fruit," "animal," or "spiritual."
|
143 |
+
- Community corrections and cosine similarity guide the AI in repositioning words closer to their accurate groupings in the embedding space.
|
144 |
+
|
145 |
+
Iterative Refinement
|
146 |
+
Through repeated feedback and fine-tuning, the AI improves its ability to place Stoney words correctly, not just individually but in the context of sentences and paragraphs. Over time, it develops a detailed, dynamic map of the Stoney language, with words clustered according to their community-informed meanings and uses.
|
147 |
+
|
148 |
+
Although this is not cosine similarity, you can see the relationships among words can concepts in Stoney as I have mapped them here: https://atlas.nomic.ai/data/harleycoops/stoney-nakoda-language-synthetic/map/5c87caaf-6be0-4546-9e83-826569070b24#nqlL
|
149 |
+
|
150 |
+
|
151 |
+
---
|
152 |
+
|
153 |
+
## Project Architecture
|
154 |
+
|
155 |
+
This code forms a complete pipeline for training and deploying a Stoney model. It is fully functional—but not correct 100% of the time—and is designed to improve through Community-In-The-Loop feedback. Access the model here:
|
156 |
+
[Stoney Language Model App](https://huggingface.co/spaces/HarleyCooper/StoneyApp)
|
157 |
+
|
158 |
+
### High-Level System Design
|
159 |
+
|
160 |
+
1. **Data Ingestion Layer**
|
161 |
+
2. **Processing Pipeline** (Q&A generation, augmentation, conversion)
|
162 |
+
3. **Model Training Framework** (fine-tuning, hyperparameters, monitoring)
|
163 |
+
4. **Inference Interface** (API endpoint, response formatting, error handling)
|
164 |
+
|
165 |
+
### Data Flow
|
166 |
+
|
167 |
+
1. Raw dictionary data → Data Ingestion
|
168 |
+
2. Processed data → Q&A Generation
|
169 |
+
3. Generated Q&A pairs → Training Data Preparation
|
170 |
+
4. Prepared data → Model Fine-tuning
|
171 |
+
5. Fine-tuned model → Inference Interface
|
172 |
+
|
173 |
+
---
|
174 |
+
|
175 |
+
## Detailed Project Structure
|
176 |
+
|
177 |
+
```
|
178 |
+
PUBLICRELEASE/
|
179 |
+
├── OpenAIFineTune/ # OpenAI fine-tuning files
|
180 |
+
│ ├── stoney_train.jsonl # Training dataset
|
181 |
+
│ └── stoney_valid.jsonl # Validation dataset
|
182 |
+
├── checkpoints/ # Model checkpoints
|
183 |
+
├── .env.example # Env variables example
|
184 |
+
├── requirements.txt # Python dependencies
|
185 |
+
├── english_dictionary.jsonl
|
186 |
+
├── stoney_dictionary.jsonl
|
187 |
+
└── bilingual_training_set.jsonl
|
188 |
+
```
|
189 |
+
|
190 |
+
---
|
191 |
+
|
192 |
+
## Core Components
|
193 |
+
|
194 |
+
### Data Generation & Processing
|
195 |
+
|
196 |
+
- **`bilingual_qa_generator.py`**
|
197 |
+
Generates Q&A pairs from dictionaries, using advanced language generation.
|
198 |
+
|
199 |
+
- **`convert_data_format.py`**
|
200 |
+
Supports multiple data formats; validates and enforces schemas.
|
201 |
+
|
202 |
+
- **`finetunesetup.py`**
|
203 |
+
Splits data (80/20) with stratified sampling and prepares files.
|
204 |
+
|
205 |
+
### Model Training
|
206 |
+
|
207 |
+
- **`openai_finetune.py`**
|
208 |
+
Handles fine-tuning, error handling, checkpointing, and logging.
|
209 |
+
|
210 |
+
---
|
211 |
+
|
212 |
+
## Comprehensive Setup Instructions
|
213 |
+
|
214 |
+
### System Requirements
|
215 |
+
|
216 |
+
- Python 3.8+
|
217 |
+
- 8GB+ RAM (16GB recommended)
|
218 |
+
- 10GB free disk space
|
219 |
+
- Stable internet connection
|
220 |
+
|
221 |
+
### Environment Setup
|
222 |
+
|
223 |
+
```bash
|
224 |
+
# Clone the repository
|
225 |
+
git clone [repository-url]
|
226 |
+
cd PUBLICRELEASE
|
227 |
+
|
228 |
+
# Create and activate a virtual environment
|
229 |
+
python -m venv venv
|
230 |
+
source venv/bin/activate # Windows: venv\Scripts\activate
|
231 |
+
|
232 |
+
# Install dependencies
|
233 |
+
pip install -r requirements.txt
|
234 |
+
|
235 |
+
```
|
236 |
+
|
237 |
+
### Configuration
|
238 |
+
|
239 |
+
```bash
|
240 |
+
# Copy example environment file
|
241 |
+
cp .env.example .env
|
242 |
+
# Provide OPENAI_API_KEY, GOOGLE_API_KEY in .env
|
243 |
+
|
244 |
+
```
|
245 |
+
|
246 |
+
### Initialization
|
247 |
+
|
248 |
+
```bash
|
249 |
+
python initialize.py
|
250 |
+
|
251 |
+
```
|
252 |
+
|
253 |
+
----------
|
254 |
+
|
255 |
+
## Detailed Usage Pipeline
|
256 |
+
|
257 |
+
### 1. Generate Training Data
|
258 |
+
|
259 |
+
```bash
|
260 |
+
python bilingual_qa_generator.py
|
261 |
+
|
262 |
+
```
|
263 |
+
|
264 |
+
- Processes `english_dictionary.jsonl` & `stoney_dictionary.jsonl`
|
265 |
+
- Produces `bilingual_training_set.jsonl`
|
266 |
+
|
267 |
+
### 2. Prepare Fine-tuning Data
|
268 |
+
|
269 |
+
```bash
|
270 |
+
python finetunesetup.py
|
271 |
+
|
272 |
+
```
|
273 |
+
|
274 |
+
- Converts Q&A to OpenAI format
|
275 |
+
- Outputs `OpenAIFineTune/stoney_train.jsonl` & `stoney_valid.jsonl`
|
276 |
+
|
277 |
+
### 3. Fine-tune Model
|
278 |
+
|
279 |
+
```bash
|
280 |
+
python openai_finetune.py
|
281 |
+
|
282 |
+
```
|
283 |
+
|
284 |
+
- Uploads files to OpenAI
|
285 |
+
- Monitors fine-tuning progress
|
286 |
+
- Implements checkpointing & logs
|
287 |
+
|
288 |
+
----------
|
289 |
+
|
290 |
+
## Advanced Model Configuration
|
291 |
+
|
292 |
+
### OpenAI Models
|
293 |
+
|
294 |
+
- Default: `gpt-4o-2024-08-06`
|
295 |
+
- Alternative: `gpt-3.5-turbo`
|
296 |
+
- `.env`: `OPENAI_MODEL`
|
297 |
+
|
298 |
+
### Google Gemini
|
299 |
+
|
300 |
+
- Default: `gemini-2.0-exp`
|
301 |
+
- `.env`: `GEMINI_MODEL`
|
302 |
+
|
303 |
+
### Hyperparameters
|
304 |
+
|
305 |
+
- LR: `1e-5`
|
306 |
+
- Batch size: `32`
|
307 |
+
- Epochs: `3`
|
308 |
+
- Context window: `4096`
|
309 |
+
|
310 |
+
----------
|
311 |
+
|
312 |
+
## Comprehensive Data Formats
|
313 |
+
|
314 |
+
### Dictionary Format
|
315 |
+
|
316 |
+
```json
|
317 |
+
{
|
318 |
+
"english_word": "example",
|
319 |
+
"stoney_versions": [
|
320 |
+
{
|
321 |
+
"word": "...",
|
322 |
+
"grammatical_classification": "...",
|
323 |
+
"meaning": "..."
|
324 |
+
}
|
325 |
+
]
|
326 |
+
}
|
327 |
+
|
328 |
+
```
|
329 |
+
|
330 |
+
### Q&A Format
|
331 |
+
|
332 |
+
```json
|
333 |
+
{
|
334 |
+
"question": "How do you say X in Stoney?",
|
335 |
+
"answer": "The Stoney word for X is...",
|
336 |
+
"source_language": "english",
|
337 |
+
"generated_at": "timestamp"
|
338 |
+
}
|
339 |
+
|
340 |
+
```
|
341 |
+
|
342 |
+
### OpenAI Training Format
|
343 |
+
|
344 |
+
```json
|
345 |
+
{
|
346 |
+
"messages": [
|
347 |
+
{"role": "system", "content": "You are a bilingual Stoney-English assistant..."},
|
348 |
+
{"role": "user", "content": "question"},
|
349 |
+
{"role": "assistant", "content": "answer"}
|
350 |
+
]
|
351 |
+
}
|
352 |
+
|
353 |
+
```
|
354 |
+
|
355 |
+
----------
|
356 |
+
|
357 |
+
## Development Guidelines
|
358 |
+
|
359 |
+
- **Style**: PEP 8, type hints, docstrings, consistent naming
|
360 |
+
- **Testing**: Unit tests, integration tests, CI, coverage
|
361 |
+
- **Documentation**: Inline comments, usage examples, troubleshooting
|
362 |
+
|
363 |
+
----------
|
364 |
+
|
365 |
+
## Contributing
|
366 |
+
|
367 |
+
1. Fork, branch, implement changes, test
|
368 |
+
2. Submit a pull request
|
369 |
+
|
370 |
+
**Code Review**
|
371 |
+
|
372 |
+
- Clear commits, small changes, documentation, test coverage
|
373 |
+
|
374 |
+
----------
|
375 |
+
|
376 |
+
## The Community-in-the-Loop Revolution
|
377 |
+
|
378 |
+
### Introduction
|
379 |
+
|
380 |
+
This project aims to preserve, refine, and resurrect endangered languages via AI fine-tuning and model distillation. Minimal lexical data can evolve into a culturally rich digital speaker of Stoney Nakoda. This subverts assumptions that massive datasets are necessary, instead emphasizing:
|
381 |
+
|
382 |
+
- Iterative improvement with community feedback
|
383 |
+
- Narrative corrections (cultural context over simple dictionary entries)
|
384 |
+
- Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning
|
385 |
+
|
386 |
+
### Conceptual Overview
|
387 |
+
|
388 |
+
**Community-in-the-Loop Model Distillation**:
|
389 |
+
|
390 |
+
1. Start with a small dictionary/text set.
|
391 |
+
2. Prompt an initial model.
|
392 |
+
3. Let the community correct errors with storytelling and context, not just words.
|
393 |
+
4. LoRA-based fine-tuning absorbs these narrative corrections.
|
394 |
+
5. The model evolves iteratively, guided by cultural custodians.
|
395 |
+
|
396 |
+
### Heart of the Approach
|
397 |
+
|
398 |
+
- **Intentional Errors**: Poke the model with tough or context-specific queries.
|
399 |
+
- **Narrative Corrections**: Rich cultural commentary instead of bare “right vs. wrong.”
|
400 |
+
- **Distillation Triplets**: (Prompt, Disallowed Reply, Narrative Reply).
|
401 |
+
- **Iterative Improvement**: If the model stumbles, revert and add more context.
|
402 |
+
|
403 |
+
### LoRA Fine-Tuning
|
404 |
+
|
405 |
+
LoRA attaches small, low-rank matrices to the base model. This dramatically reduces compute and speeds up retraining:
|
406 |
+
|
407 |
+
- **Efficiency**: Fraction of resources required vs. full retraining
|
408 |
+
- **Focused Updates**: Capturing the “essence” of new knowledge
|
409 |
+
- **Rapid Iterations**: Frequent refinement without heavy overhead
|
410 |
+
|
411 |
+
### Mathematical Foundations
|
412 |
+
|
413 |
+
If W0\mathbf{W}_0 is the base weight matrix, LoRA introduces ΔW=AB\Delta \mathbf{W} = \mathbf{A}\mathbf{B} with A∈Rd×r\mathbf{A} \in \mathbb{R}^{d \times r} and B∈Rr×k\mathbf{B} \in \mathbb{R}^{r \times k}, where r≪min(d,k)r \ll \min(d,k). Loss functions track both linguistic and cultural accuracy (e.g., a “Cultural Authenticity Score”).
|
414 |
+
|
415 |
+
### Mermaid Diagram
|
416 |
+
|
417 |
+
```mermaid
|
418 |
+
graph TD
|
419 |
+
A[Initial Model] --> B[Generate Response]
|
420 |
+
B --> C{Correct?}
|
421 |
+
C -->|No| D[Community Correction]
|
422 |
+
D --> E[Create Distillation Triplet]
|
423 |
+
E --> F[LoRA Fine-Tuning]
|
424 |
+
F --> A
|
425 |
+
C -->|Yes| G[Validation]
|
426 |
+
|
427 |
+
```
|
428 |
+
|
429 |
+
### Cultural Integrity
|
430 |
+
|
431 |
+
Every correction preserves cultural norms—idioms, humor, oral traditions—and ensures the community wields control over the AI’s “mindset.”
|
432 |
+
|
433 |
+
### Data Sources
|
434 |
+
|
435 |
+
A 10,000-word Stoney Nakoda dictionary and community textbooks serve as seeds. Community feedback enriches this data over time, weaving historical memory into the model.
|
436 |
+
|
437 |
+
### Expanding the Concept
|
438 |
+
|
439 |
+
From a tiny dictionary to an AI that:
|
440 |
+
|
441 |
+
- **Understands context** (formal/informal usage)
|
442 |
+
- **Integrates cultural references** (stories, metaphors)
|
443 |
+
- **Remembers history** (ancestors, ceremonies, seasonal events)
|
444 |
+
|
445 |
+
### Adaptive Checkpoints
|
446 |
+
|
447 |
+
- **Forward Progress**: Keep the new checkpoint if improved.
|
448 |
+
- **Reversion**: If degraded, roll back and increase context in corrections.
|
449 |
+
- **Convergence**: Repeat until stable authenticity and fluency metrics are met.
|
450 |
+
|
451 |
+
### Example Workflow
|
452 |
+
|
453 |
+
1. **Prompt**: “How to say ‘taste slightly with the tip of your tongue’ in Stoney?”
|
454 |
+
2. **Model’s Flawed Reply**: “`supthîyach`” (incorrect).
|
455 |
+
3. **Community Correction**: Shares the correct phrase plus a story from childhood.
|
456 |
+
4. **Distillation Triplet**: (Prompt, Disallowed, Narrative).
|
457 |
+
5. **LoRA Fine-Tuning**: Model adjusts swiftly.
|
458 |
+
6. **Re-Evaluation**: Answers improve in subsequent queries.
|
459 |
+
|
460 |
+
### Monitoring & QA
|
461 |
+
|
462 |
+
- **Cultural Authenticity Score (CAS)**
|
463 |
+
- **Linguistic Fluency** (perplexity, cross-entropy)
|
464 |
+
- **Validation Loops** (watch for regressions, revert if needed)
|
465 |
+
|
466 |
+
### Future Directions
|
467 |
+
|
468 |
+
- **Oral Histories**: Model retells century-old stories.
|
469 |
+
- **Seasonal Knowledge**: Terms tied to ceremonies and ecological cycles.
|
470 |
+
- **Dialects/Accents**: Respecting sub-regional differences.
|
471 |
+
- **Educational Tools**: Interactive AI for language learning.
|
472 |
+
- **Ethical AI**: Centered on consent, community governance, cultural integrity.
|
473 |
+
|
474 |
+
### Glossary
|
475 |
+
|
476 |
+
- **CAS**: Cultural Authenticity Score
|
477 |
+
- **Distillation Triplet**: (Prompt, Flawed Reply, Narrative Reply)
|
478 |
+
- **LoRA**: Low-Rank Adaptation
|
479 |
+
- **Community-in-the-Loop**: Paradigm of continuous human-guided refinement
|
480 |
+
"""
|
481 |
+
|
482 |
+
# Store conversation history
|
483 |
+
conversation_history = []
|
484 |
+
|
485 |
+
def process_initial_prompt():
|
486 |
+
"""Process the initial prompt and return the response"""
|
487 |
+
generation_config = {
|
488 |
+
"temperature": 1.1,
|
489 |
+
"top_p": .95,
|
490 |
+
"top_k": 1,
|
491 |
+
"max_output_tokens": 4500,
|
492 |
+
}
|
493 |
+
|
494 |
+
safety_settings = [
|
495 |
+
{"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
|
496 |
+
{"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
|
497 |
+
{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
|
498 |
+
{"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
|
499 |
+
]
|
500 |
+
|
501 |
+
response = model.generate_content(
|
502 |
+
INITIAL_PROMPT,
|
503 |
+
generation_config=generation_config,
|
504 |
+
safety_settings=safety_settings,
|
505 |
+
stream=True
|
506 |
+
)
|
507 |
+
return response
|
508 |
+
|
509 |
+
def process_follow_up(message, history):
|
510 |
+
"""Process follow-up questions using the context from the initial prompt"""
|
511 |
+
# Format history into a string
|
512 |
+
history_str = "\n".join([f"Human: {h[0]}\nAssistant: {h[1]}" for h in history if h[0] is not None])
|
513 |
+
|
514 |
+
# Combine the original prompt, history, and new question
|
515 |
+
full_context = f"{INITIAL_PROMPT}\n\nPrevious conversation:\n{history_str}\n\nNew question: {message}"
|
516 |
+
|
517 |
+
generation_config = {
|
518 |
+
"temperature": 0.9,
|
519 |
+
"top_p": 1,
|
520 |
+
"top_k": 1,
|
521 |
+
"max_output_tokens": 2048,
|
522 |
+
}
|
523 |
+
|
524 |
+
safety_settings = [
|
525 |
+
{"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
|
526 |
+
{"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
|
527 |
+
{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
|
528 |
+
{"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
|
529 |
+
]
|
530 |
+
|
531 |
+
response = model.generate_content(
|
532 |
+
full_context,
|
533 |
+
generation_config=generation_config,
|
534 |
+
safety_settings=safety_settings,
|
535 |
+
stream=True
|
536 |
+
)
|
537 |
+
|
538 |
+
# Collect the response chunks
|
539 |
+
response_text = ""
|
540 |
+
for chunk in response:
|
541 |
+
response_text += chunk.text
|
542 |
+
yield [[message, response_text]]
|
543 |
+
|
544 |
+
def create_interface():
|
545 |
+
"""Create and configure the Gradio interface"""
|
546 |
+
with gr.Blocks(css="footer {visibility: hidden}") as demo:
|
547 |
+
gr.Markdown("# You are Asking Google Deep Mind about \"From Whispers to Voices\", it needs 15 seconds to think")
|
548 |
+
chatbot = gr.Chatbot(show_label=False)
|
549 |
+
|
550 |
+
# Add custom CSS for wider chat window and proper scrolling
|
551 |
+
gr.HTML("""
|
552 |
+
<style>
|
553 |
+
.gradio-container {
|
554 |
+
max-width: 95% !important;
|
555 |
+
margin-left: auto !important;
|
556 |
+
margin-right: auto !important;
|
557 |
+
min-height: 100vh !important;
|
558 |
+
}
|
559 |
+
.contain {
|
560 |
+
min-height: 85vh !important;
|
561 |
+
}
|
562 |
+
.wrap.svelte-byatnx {
|
563 |
+
max-height: none !important;
|
564 |
+
overflow: visible !important;
|
565 |
+
}
|
566 |
+
.message.svelte-byatnx {
|
567 |
+
overflow-wrap: break-word !important;
|
568 |
+
white-space: pre-wrap !important;
|
569 |
+
}
|
570 |
+
</style>
|
571 |
+
""")
|
572 |
+
|
573 |
+
# Auto-trigger the initial prompt on page load
|
574 |
+
def on_load():
|
575 |
+
response = process_initial_prompt()
|
576 |
+
response_text = ""
|
577 |
+
for chunk in response:
|
578 |
+
response_text += chunk.text
|
579 |
+
yield [[None, response_text]]
|
580 |
+
|
581 |
+
demo.load(on_load, None, [chatbot])
|
582 |
+
|
583 |
+
return demo
|
584 |
+
|
585 |
+
# Create and launch the interface
|
586 |
+
demo = create_interface()
|
587 |
+
|
588 |
+
if __name__ == "__main__":
|
589 |
+
demo.launch()
|
instructions.txt
ADDED
@@ -0,0 +1,480 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
1. We are going to build a very simple Gradio app or hugging face space that is going to start by passing a large prompt to the new Gemini 2.0 Flash thinking experimental 1-21 model and ask a specific question.
|
2 |
+
|
3 |
+
2. The intent will be that the user will have the answer streaming as soon as they open the app. The user will be able to ask a new question using the context in the original prompt and the streaming thinking that is available in the API.
|
4 |
+
|
5 |
+
3. We will deploy this with a simple link to the gradio app and we want to start the app by automatically passing this prompt to the API and waiting for the reply:
|
6 |
+
|
7 |
+
This is the exact prompt we want to pass to the model but we also want to "hide" this from the user so the first thing the user sees as they open the gradio App is the reply to this question:
|
8 |
+
|
9 |
+
"based on the totality of your training, how revolutionary is this project by Christian H. Cooper given it is a working model that has been fine tuned and deployed
|
10 |
+
|
11 |
+
|
12 |
+
# From Whispers to Voices: A "Community-In-The-Loop" Proposal for Model Distillation and Language Preservation
|
13 |
+
|
14 |
+
New Years Day, 2025
|
15 |
+
|
16 |
+
A working model of the Stoney Nakoda language has been developed and is now available for community-in-the-loop testing in 2025:
|
17 |
+
|
18 |
+
- **Model App**: [Stoney Language Model App](https://huggingface.co/spaces/HarleyCooper/StoneyApp)
|
19 |
+
- **Training Data**: [StoneyNakoda Training Dataset](https://huggingface.co/datasets/HarleyCooper/StoneyNakoda/blob/main/zSTONEY1_TRAINING_SET.jsonl)
|
20 |
+
|
21 |
+
|
22 |
+
Any First Nations community seeking to apply this approach to their own language is warmly invited to reach out.
|
23 |
+
|
24 |
+
By following this code, you can build a model for any low-resource language. The starting dictionary size should be ~8,000 words.
|
25 |
+
|
26 |
+
---
|
27 |
+
|
28 |
+
## Table of Contents
|
29 |
+
|
30 |
+
1. [New Years Day, Canadian Rockies, 2025](#introduction)
|
31 |
+
2. [Understanding How AI Learns Stoney Words Using Cosine Similarity](#understanding-how-ai-learns-stoney-words-using-cosine-similarity)
|
32 |
+
3. [Project Architecture](#project-architecture)
|
33 |
+
- [High-Level System Design](#high-level-system-design)
|
34 |
+
- [Data Flow](#data-flow)
|
35 |
+
4. [Detailed Project Structure](#detailed-project-structure)
|
36 |
+
5. [Core Components](#core-components)
|
37 |
+
- [Data Generation & Processing](#data-generation--processing)
|
38 |
+
- [Model Training](#model-training)
|
39 |
+
6. [Comprehensive Setup Instructions](#comprehensive-setup-instructions)
|
40 |
+
- [System Requirements](#system-requirements)
|
41 |
+
- [Environment Setup](#environment-setup)
|
42 |
+
- [Configuration](#configuration)
|
43 |
+
- [Initialization](#initialization)
|
44 |
+
7. [Detailed Usage Pipeline](#detailed-usage-pipeline)
|
45 |
+
1. [Generate Training Data](#1-generate-training-data)
|
46 |
+
2. [Prepare Fine-tuning Data](#2-prepare-fine-tuning-data)
|
47 |
+
3. [Fine-tune Model](#3-fine-tune-model)
|
48 |
+
8. [Advanced Model Configuration](#advanced-model-configuration)
|
49 |
+
- [OpenAI Models](#openai-models)
|
50 |
+
- [Google Gemini](#google-gemini)
|
51 |
+
- [Hyperparameters](#hyperparameters)
|
52 |
+
9. [Comprehensive Data Formats](#comprehensive-data-formats)
|
53 |
+
- [Dictionary Format](#dictionary-format)
|
54 |
+
- [Q&A Format](#qa-format)
|
55 |
+
- [OpenAI Training Format](#openai-training-format)
|
56 |
+
10. [Development Guidelines](#development-guidelines)
|
57 |
+
11. [Contributing](#contributing)
|
58 |
+
12. [License](#license)
|
59 |
+
13. [Acknowledgments](#acknowledgments)
|
60 |
+
14. [The Community-in-the-Loop Revolution](#the-community-in-the-loop-revolution)
|
61 |
+
- [Introduction](#introduction-1)
|
62 |
+
- [Conceptual Overview](#conceptual-overview)
|
63 |
+
- [Heart of the Approach](#heart-of-the-approach)
|
64 |
+
- [LoRA Fine-Tuning](#lora-fine-tuning)
|
65 |
+
- [Mathematical Foundations](#mathematical-foundations)
|
66 |
+
- [Mermaid Diagram](#mermaid-diagram)
|
67 |
+
- [Cultural Integrity](#cultural-integrity)
|
68 |
+
- [Data Sources](#data-sources)
|
69 |
+
- [Expanding the Concept](#expanding-the-concept)
|
70 |
+
- [Adaptive Checkpoints](#adaptive-checkpoints)
|
71 |
+
- [Example Workflow](#example-workflow)
|
72 |
+
- [Monitoring & QA](#monitoring--qa)
|
73 |
+
- [Future Directions](#future-directions)
|
74 |
+
- [Glossary](#glossary)
|
75 |
+
|
76 |
+
---
|
77 |
+
|
78 |
+
## Introduction
|
79 |
+
|
80 |
+
In my office, there is a murder; a map of one, at least.
|
81 |
+
|
82 |
+
![Dawson's Map of the Bow Valley](Public/FullDawsonMap.jpg)
|
83 |
+
|
84 |
+
George Mercer Dawson explored the Bow Valley in the late 1800s, noting language on the British Columbia side. His map, though richly colored, stands like a tombstone over the Bow Valley where the Stoney people lived because he made no notes on their language and simply noted the people as "recent immigrants"
|
85 |
+
|
86 |
+
![Detail of Dawson Map](Public/dawsondetail.jpg)
|
87 |
+
|
88 |
+
What is very obvious from the linguistic patterns among the Haida, Tshimsia, Thlinkit, Kwakiool and Kawitshin dialects nearby is that languages blend like “linguistic DNA,” and machine learning could help trace faint threads of lost speech to their roots. Where some see isolation as a curse, in the age of AI, Stoney’s isolation turns out to be its strength.
|
89 |
+
|
90 |
+
For about two years, I thought about the size of the vector space that would be needed to get a model to self-train on a set of 100% indigenous data, and how that model could refine its grasp of the broader Stoney Language. This is now publicly and freely available.
|
91 |
+
|
92 |
+
|
93 |
+
Two key releases influenced my thinking of what was possible:
|
94 |
+
|
95 |
+
1. [Meta’s Llama-3 Model (April 18th, 2024)](https://www.reuters.com/technology/meta-releases-early-versions-its-llama-3-ai-model-2024-04-18/)
|
96 |
+
2. [OpenAI Fine-Tuning API (October 2024)](https://openai.com/index/api-model-distillation/)
|
97 |
+
|
98 |
+
Both gave me the motivation to build what’s presented here. The true innovation here lies in how communities can narratively correct the initially flawed response (about 10% of the time, the model works every time.) then that feeback be passed seamleslly back into the fine-tuning process. The [textbooks](https://globalnews.ca/news/9430501/stoney-nakota-language-textbook/) that the Stoney community created—intended as educational tools—became perfect concept of a model prompts, each chapter or word offering pure indigenous data devoid of external weights or biases to the fine-tuning process.
|
99 |
+
|
100 |
+
|
101 |
+
Early in 2023, I found an original, unpublished sketch by James Hector likely drawn in the summer of 1858 or 1859 along the Bow River in Southern Alberta:
|
102 |
+
|
103 |
+
![Sketch by James Hector of a Stoney Woman](Public/StoneyWoman.jpg)
|
104 |
+
|
105 |
+
Finding this, and already aware of George Mercer Dawson's work on First Nation's language on the British Columbia side, I was inspired to put the effort in and build a working model of the language and implement the Community-In-The-Loop distillation method.
|
106 |
+
|
107 |
+
This sketch shifted my thinking from considering the "Stoney People” to this "Stoney Woman” who saw these same mountains and rivers I see everyday, yet who had a very different way to think about and communicate to the world around her. The Community-in-the-Loop model distillation will quickly converge this initial model toward fluencey. I suspect this will require the community to correct about 80,000 question and answer pairs and would cost less than $800 in OpenAI computing power. Recent releases by Google and the Chinese Lab DeepSeek, could effectively reduce the cost to zero.
|
108 |
+
|
109 |
+
I think what this project has left me considering most ist that a century from now, strangers will live in all our homes and most of what we worry about today will not matter. But we can honor “Stoney Woman” by making sure her language endures, forging a living record in an age of AI. Incredibly, this tool will work with any first nations language, as long as there is a starting dictionary of about 8,000 words.
|
110 |
+
|
111 |
+
**I am freely available to help any First Nation in Canada.**
|
112 |
+
|
113 |
+
## Understanding How AI Learns Stoney Words Using Cosine Similarity
|
114 |
+
|
115 |
+
Word Embeddings: Mapping Words in Space
|
116 |
+
Word embeddings are like placing words in a high-dimensional map, where similar words are positioned closer together. For example, "strawberry," "orange," and "cherry" might form a cluster because they are fruits, while "laptop," "Microsoft," and "Android" might cluster elsewhere as tech-related terms. Each axis in this space represents a characteristic of the words, such as their context or meaning.
|
117 |
+
|
118 |
+
Context Shapes Meaning
|
119 |
+
A word's position in this space isn't fixed—it shifts based on context. For instance, the word "apple" could mean a fruit or the tech brand, depending on its surrounding words, like "buy" (tech) or "tree" (fruit). This dynamic placement captures the nuances of meaning.
|
120 |
+
|
121 |
+
Cosine Similarity: Measuring Relationships
|
122 |
+
Cosine similarity quantifies how similar two words are by measuring the angle between their vectors in the embedding space:
|
123 |
+
|
124 |
+
- Similar words have vectors pointing in nearly the same direction (cosine similarity close to 1)
|
125 |
+
- Unrelated words have vectors at a right angle (cosine similarity near 0)
|
126 |
+
- Opposite meanings have vectors pointing in opposite directions (cosine similarity close to -1)
|
127 |
+
- For example, "cherry" and "orange" might have a similarity of 0.97, while "cherry" and "laptop" might score 0.24
|
128 |
+
|
129 |
+
How AI Learns Stoney Words
|
130 |
+
|
131 |
+
- **Stoney Dictionary as a Starting Point:**
|
132 |
+
The AI begins with a structured dictionary of Stoney words, including translations, categories, pronunciations, and cultural context.
|
133 |
+
|
134 |
+
- **Community Feedback for Learning:**
|
135 |
+
The AI makes initial translations, which are often incorrect. Stoney speakers provide corrections, enriched with cultural context, stories, and humor. This feedback helps refine the AI's understanding.
|
136 |
+
|
137 |
+
The Role of Cosine Similarity in AI Learning
|
138 |
+
|
139 |
+
- The AI uses word embeddings to group Stoney words based on their meaning. For example, it determines whether a word belongs to a category like "fruit," "animal," or "spiritual."
|
140 |
+
- Community corrections and cosine similarity guide the AI in repositioning words closer to their accurate groupings in the embedding space.
|
141 |
+
|
142 |
+
Iterative Refinement
|
143 |
+
Through repeated feedback and fine-tuning, the AI improves its ability to place Stoney words correctly, not just individually but in the context of sentences and paragraphs. Over time, it develops a detailed, dynamic map of the Stoney language, with words clustered according to their community-informed meanings and uses.
|
144 |
+
|
145 |
+
Although this is not cosine similarity, you can see the relationships among words can concepts in Stoney as I have mapped them here: https://atlas.nomic.ai/data/harleycoops/stoney-nakoda-language-synthetic/map/5c87caaf-6be0-4546-9e83-826569070b24#nqlL
|
146 |
+
|
147 |
+
|
148 |
+
---
|
149 |
+
|
150 |
+
## Project Architecture
|
151 |
+
|
152 |
+
This code forms a complete pipeline for training and deploying a Stoney model. It is fully functional—but not correct 100% of the time—and is designed to improve through Community-In-The-Loop feedback. Access the model here:
|
153 |
+
[Stoney Language Model App](https://huggingface.co/spaces/HarleyCooper/StoneyApp)
|
154 |
+
|
155 |
+
### High-Level System Design
|
156 |
+
|
157 |
+
1. **Data Ingestion Layer**
|
158 |
+
2. **Processing Pipeline** (Q&A generation, augmentation, conversion)
|
159 |
+
3. **Model Training Framework** (fine-tuning, hyperparameters, monitoring)
|
160 |
+
4. **Inference Interface** (API endpoint, response formatting, error handling)
|
161 |
+
|
162 |
+
### Data Flow
|
163 |
+
|
164 |
+
1. Raw dictionary data → Data Ingestion
|
165 |
+
2. Processed data → Q&A Generation
|
166 |
+
3. Generated Q&A pairs → Training Data Preparation
|
167 |
+
4. Prepared data → Model Fine-tuning
|
168 |
+
5. Fine-tuned model → Inference Interface
|
169 |
+
|
170 |
+
---
|
171 |
+
|
172 |
+
## Detailed Project Structure
|
173 |
+
|
174 |
+
```
|
175 |
+
PUBLICRELEASE/
|
176 |
+
├── OpenAIFineTune/ # OpenAI fine-tuning files
|
177 |
+
│ ├── stoney_train.jsonl # Training dataset
|
178 |
+
│ └── stoney_valid.jsonl # Validation dataset
|
179 |
+
├── checkpoints/ # Model checkpoints
|
180 |
+
├── .env.example # Env variables example
|
181 |
+
├── requirements.txt # Python dependencies
|
182 |
+
├── english_dictionary.jsonl
|
183 |
+
├── stoney_dictionary.jsonl
|
184 |
+
└── bilingual_training_set.jsonl
|
185 |
+
```
|
186 |
+
|
187 |
+
---
|
188 |
+
|
189 |
+
## Core Components
|
190 |
+
|
191 |
+
### Data Generation & Processing
|
192 |
+
|
193 |
+
- **`bilingual_qa_generator.py`**
|
194 |
+
Generates Q&A pairs from dictionaries, using advanced language generation.
|
195 |
+
|
196 |
+
- **`convert_data_format.py`**
|
197 |
+
Supports multiple data formats; validates and enforces schemas.
|
198 |
+
|
199 |
+
- **`finetunesetup.py`**
|
200 |
+
Splits data (80/20) with stratified sampling and prepares files.
|
201 |
+
|
202 |
+
### Model Training
|
203 |
+
|
204 |
+
- **`openai_finetune.py`**
|
205 |
+
Handles fine-tuning, error handling, checkpointing, and logging.
|
206 |
+
|
207 |
+
---
|
208 |
+
|
209 |
+
## Comprehensive Setup Instructions
|
210 |
+
|
211 |
+
### System Requirements
|
212 |
+
|
213 |
+
- Python 3.8+
|
214 |
+
- 8GB+ RAM (16GB recommended)
|
215 |
+
- 10GB free disk space
|
216 |
+
- Stable internet connection
|
217 |
+
|
218 |
+
### Environment Setup
|
219 |
+
|
220 |
+
```bash
|
221 |
+
# Clone the repository
|
222 |
+
git clone [repository-url]
|
223 |
+
cd PUBLICRELEASE
|
224 |
+
|
225 |
+
# Create and activate a virtual environment
|
226 |
+
python -m venv venv
|
227 |
+
source venv/bin/activate # Windows: venv\Scripts\activate
|
228 |
+
|
229 |
+
# Install dependencies
|
230 |
+
pip install -r requirements.txt
|
231 |
+
|
232 |
+
```
|
233 |
+
|
234 |
+
### Configuration
|
235 |
+
|
236 |
+
```bash
|
237 |
+
# Copy example environment file
|
238 |
+
cp .env.example .env
|
239 |
+
# Provide OPENAI_API_KEY, GOOGLE_API_KEY in .env
|
240 |
+
|
241 |
+
```
|
242 |
+
|
243 |
+
### Initialization
|
244 |
+
|
245 |
+
```bash
|
246 |
+
python initialize.py
|
247 |
+
|
248 |
+
```
|
249 |
+
|
250 |
+
----------
|
251 |
+
|
252 |
+
## Detailed Usage Pipeline
|
253 |
+
|
254 |
+
### 1. Generate Training Data
|
255 |
+
|
256 |
+
```bash
|
257 |
+
python bilingual_qa_generator.py
|
258 |
+
|
259 |
+
```
|
260 |
+
|
261 |
+
- Processes `english_dictionary.jsonl` & `stoney_dictionary.jsonl`
|
262 |
+
- Produces `bilingual_training_set.jsonl`
|
263 |
+
|
264 |
+
### 2. Prepare Fine-tuning Data
|
265 |
+
|
266 |
+
```bash
|
267 |
+
python finetunesetup.py
|
268 |
+
|
269 |
+
```
|
270 |
+
|
271 |
+
- Converts Q&A to OpenAI format
|
272 |
+
- Outputs `OpenAIFineTune/stoney_train.jsonl` & `stoney_valid.jsonl`
|
273 |
+
|
274 |
+
### 3. Fine-tune Model
|
275 |
+
|
276 |
+
```bash
|
277 |
+
python openai_finetune.py
|
278 |
+
|
279 |
+
```
|
280 |
+
|
281 |
+
- Uploads files to OpenAI
|
282 |
+
- Monitors fine-tuning progress
|
283 |
+
- Implements checkpointing & logs
|
284 |
+
|
285 |
+
----------
|
286 |
+
|
287 |
+
## Advanced Model Configuration
|
288 |
+
|
289 |
+
### OpenAI Models
|
290 |
+
|
291 |
+
- Default: `gpt-4o-2024-08-06`
|
292 |
+
- Alternative: `gpt-3.5-turbo`
|
293 |
+
- `.env`: `OPENAI_MODEL`
|
294 |
+
|
295 |
+
### Google Gemini
|
296 |
+
|
297 |
+
- Default: `gemini-2.0-exp`
|
298 |
+
- `.env`: `GEMINI_MODEL`
|
299 |
+
|
300 |
+
### Hyperparameters
|
301 |
+
|
302 |
+
- LR: `1e-5`
|
303 |
+
- Batch size: `32`
|
304 |
+
- Epochs: `3`
|
305 |
+
- Context window: `4096`
|
306 |
+
|
307 |
+
----------
|
308 |
+
|
309 |
+
## Comprehensive Data Formats
|
310 |
+
|
311 |
+
### Dictionary Format
|
312 |
+
|
313 |
+
```json
|
314 |
+
{
|
315 |
+
"english_word": "example",
|
316 |
+
"stoney_versions": [
|
317 |
+
{
|
318 |
+
"word": "...",
|
319 |
+
"grammatical_classification": "...",
|
320 |
+
"meaning": "..."
|
321 |
+
}
|
322 |
+
]
|
323 |
+
}
|
324 |
+
|
325 |
+
```
|
326 |
+
|
327 |
+
### Q&A Format
|
328 |
+
|
329 |
+
```json
|
330 |
+
{
|
331 |
+
"question": "How do you say X in Stoney?",
|
332 |
+
"answer": "The Stoney word for X is...",
|
333 |
+
"source_language": "english",
|
334 |
+
"generated_at": "timestamp"
|
335 |
+
}
|
336 |
+
|
337 |
+
```
|
338 |
+
|
339 |
+
### OpenAI Training Format
|
340 |
+
|
341 |
+
```json
|
342 |
+
{
|
343 |
+
"messages": [
|
344 |
+
{"role": "system", "content": "You are a bilingual Stoney-English assistant..."},
|
345 |
+
{"role": "user", "content": "question"},
|
346 |
+
{"role": "assistant", "content": "answer"}
|
347 |
+
]
|
348 |
+
}
|
349 |
+
|
350 |
+
```
|
351 |
+
|
352 |
+
----------
|
353 |
+
|
354 |
+
## Development Guidelines
|
355 |
+
|
356 |
+
- **Style**: PEP 8, type hints, docstrings, consistent naming
|
357 |
+
- **Testing**: Unit tests, integration tests, CI, coverage
|
358 |
+
- **Documentation**: Inline comments, usage examples, troubleshooting
|
359 |
+
|
360 |
+
----------
|
361 |
+
|
362 |
+
## Contributing
|
363 |
+
|
364 |
+
1. Fork, branch, implement changes, test
|
365 |
+
2. Submit a pull request
|
366 |
+
|
367 |
+
**Code Review**
|
368 |
+
|
369 |
+
- Clear commits, small changes, documentation, test coverage
|
370 |
+
|
371 |
+
----------
|
372 |
+
|
373 |
+
## The Community-in-the-Loop Revolution
|
374 |
+
|
375 |
+
### Introduction
|
376 |
+
|
377 |
+
This project aims to preserve, refine, and resurrect endangered languages via AI fine-tuning and model distillation. Minimal lexical data can evolve into a culturally rich digital speaker of Stoney Nakoda. This subverts assumptions that massive datasets are necessary, instead emphasizing:
|
378 |
+
|
379 |
+
- Iterative improvement with community feedback
|
380 |
+
- Narrative corrections (cultural context over simple dictionary entries)
|
381 |
+
- Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning
|
382 |
+
|
383 |
+
### Conceptual Overview
|
384 |
+
|
385 |
+
**Community-in-the-Loop Model Distillation**:
|
386 |
+
|
387 |
+
1. Start with a small dictionary/text set.
|
388 |
+
2. Prompt an initial model.
|
389 |
+
3. Let the community correct errors with storytelling and context, not just words.
|
390 |
+
4. LoRA-based fine-tuning absorbs these narrative corrections.
|
391 |
+
5. The model evolves iteratively, guided by cultural custodians.
|
392 |
+
|
393 |
+
### Heart of the Approach
|
394 |
+
|
395 |
+
- **Intentional Errors**: Poke the model with tough or context-specific queries.
|
396 |
+
- **Narrative Corrections**: Rich cultural commentary instead of bare “right vs. wrong.”
|
397 |
+
- **Distillation Triplets**: (Prompt, Disallowed Reply, Narrative Reply).
|
398 |
+
- **Iterative Improvement**: If the model stumbles, revert and add more context.
|
399 |
+
|
400 |
+
### LoRA Fine-Tuning
|
401 |
+
|
402 |
+
LoRA attaches small, low-rank matrices to the base model. This dramatically reduces compute and speeds up retraining:
|
403 |
+
|
404 |
+
- **Efficiency**: Fraction of resources required vs. full retraining
|
405 |
+
- **Focused Updates**: Capturing the “essence” of new knowledge
|
406 |
+
- **Rapid Iterations**: Frequent refinement without heavy overhead
|
407 |
+
|
408 |
+
### Mathematical Foundations
|
409 |
+
|
410 |
+
If W0\mathbf{W}_0 is the base weight matrix, LoRA introduces ΔW=AB\Delta \mathbf{W} = \mathbf{A}\mathbf{B} with A∈Rd×r\mathbf{A} \in \mathbb{R}^{d \times r} and B∈Rr×k\mathbf{B} \in \mathbb{R}^{r \times k}, where r≪min(d,k)r \ll \min(d,k). Loss functions track both linguistic and cultural accuracy (e.g., a “Cultural Authenticity Score”).
|
411 |
+
|
412 |
+
### Mermaid Diagram
|
413 |
+
|
414 |
+
```mermaid
|
415 |
+
graph TD
|
416 |
+
A[Initial Model] --> B[Generate Response]
|
417 |
+
B --> C{Correct?}
|
418 |
+
C -->|No| D[Community Correction]
|
419 |
+
D --> E[Create Distillation Triplet]
|
420 |
+
E --> F[LoRA Fine-Tuning]
|
421 |
+
F --> A
|
422 |
+
C -->|Yes| G[Validation]
|
423 |
+
|
424 |
+
```
|
425 |
+
|
426 |
+
### Cultural Integrity
|
427 |
+
|
428 |
+
Every correction preserves cultural norms—idioms, humor, oral traditions—and ensures the community wields control over the AI’s “mindset.”
|
429 |
+
|
430 |
+
### Data Sources
|
431 |
+
|
432 |
+
A 10,000-word Stoney Nakoda dictionary and community textbooks serve as seeds. Community feedback enriches this data over time, weaving historical memory into the model.
|
433 |
+
|
434 |
+
### Expanding the Concept
|
435 |
+
|
436 |
+
From a tiny dictionary to an AI that:
|
437 |
+
|
438 |
+
- **Understands context** (formal/informal usage)
|
439 |
+
- **Integrates cultural references** (stories, metaphors)
|
440 |
+
- **Remembers history** (ancestors, ceremonies, seasonal events)
|
441 |
+
|
442 |
+
### Adaptive Checkpoints
|
443 |
+
|
444 |
+
- **Forward Progress**: Keep the new checkpoint if improved.
|
445 |
+
- **Reversion**: If degraded, roll back and increase context in corrections.
|
446 |
+
- **Convergence**: Repeat until stable authenticity and fluency metrics are met.
|
447 |
+
|
448 |
+
### Example Workflow
|
449 |
+
|
450 |
+
1. **Prompt**: “How to say ‘taste slightly with the tip of your tongue’ in Stoney?”
|
451 |
+
2. **Model’s Flawed Reply**: “`supthîyach`” (incorrect).
|
452 |
+
3. **Community Correction**: Shares the correct phrase plus a story from childhood.
|
453 |
+
4. **Distillation Triplet**: (Prompt, Disallowed, Narrative).
|
454 |
+
5. **LoRA Fine-Tuning**: Model adjusts swiftly.
|
455 |
+
6. **Re-Evaluation**: Answers improve in subsequent queries.
|
456 |
+
|
457 |
+
### Monitoring & QA
|
458 |
+
|
459 |
+
- **Cultural Authenticity Score (CAS)**
|
460 |
+
- **Linguistic Fluency** (perplexity, cross-entropy)
|
461 |
+
- **Validation Loops** (watch for regressions, revert if needed)
|
462 |
+
|
463 |
+
### Future Directions
|
464 |
+
|
465 |
+
- **Oral Histories**: Model retells century-old stories.
|
466 |
+
- **Seasonal Knowledge**: Terms tied to ceremonies and ecological cycles.
|
467 |
+
- **Dialects/Accents**: Respecting sub-regional differences.
|
468 |
+
- **Educational Tools**: Interactive AI for language learning.
|
469 |
+
- **Ethical AI**: Centered on consent, community governance, cultural integrity.
|
470 |
+
|
471 |
+
### Glossary
|
472 |
+
|
473 |
+
- **CAS**: Cultural Authenticity Score
|
474 |
+
- **Distillation Triplet**: (Prompt, Flawed Reply, Narrative Reply)
|
475 |
+
- **LoRA**: Low-Rank Adaptation
|
476 |
+
- **Community-in-the-Loop**: Paradigm of continuous human-guided refinement
|
477 |
+
|
478 |
+
|
479 |
+
|
480 |
+
"
|
requirements.txt
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
dropbox
|
2 |
+
gradio
|
3 |
+
mistralai
|
4 |
+
anthropic
|
5 |
+
pinecone-client
|
6 |
+
nomic
|
7 |
+
openai
|
8 |
+
groq
|
9 |
+
langchain-community
|
10 |
+
replicate
|
11 |
+
google-generativeai
|
12 |
+
perplexityai
|
13 |
+
cohere
|
14 |
+
langchain
|
15 |
+
alpha_vantage
|
16 |
+
huggingface_hub
|
17 |
+
tavily-python
|
18 |
+
python-mathpix
|
19 |
+
requests
|
20 |
+
python-dotenv
|
21 |
+
sympy
|
22 |
+
numpy-financial
|
23 |
+
numpy
|
24 |
+
pandas
|
25 |
+
mesop
|
26 |
+
langchain-anthropic
|
27 |
+
langchain-openai
|
28 |
+
openpyxl
|
29 |
+
beautifulsoup4
|
30 |
+
langchain-mistralai
|
31 |
+
google-cloud-secret-manager
|
32 |
+
selenium
|
33 |
+
google-auth-oauthlib
|
34 |
+
google-auth-httplib2
|
35 |
+
google-api-python-client
|
36 |
+
PyPDF2
|
37 |
+
python-docx
|
38 |
+
markdown
|
39 |
+
Pygithub
|
40 |
+
GitPython
|
41 |
+
SpeechRecognition
|
42 |
+
google-cloud-storage
|
43 |
+
nltk
|
tools.py
ADDED
@@ -0,0 +1,864 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
'''
|
2 |
+
Guidelines for Creating and Utilizing Tools in tools.py:
|
3 |
+
|
4 |
+
1. Initial Assessment:
|
5 |
+
|
6 |
+
Review Existing Tools:
|
7 |
+
Before adding new functions, thoroughly read through tools.py to understand the existing tools and their functionalities.
|
8 |
+
Determine if an existing tool can be adapted or extended to meet the current needs, avoiding redundancy.
|
9 |
+
2. Tool Creation and Function Design:
|
10 |
+
|
11 |
+
Create Within tools.py:
|
12 |
+
|
13 |
+
Add new tools as functions within tools.py. If tools.py doesn't exist, create it.
|
14 |
+
Ensure each function is self-contained and focused on a single task for modularity.
|
15 |
+
Design for Importing:
|
16 |
+
|
17 |
+
Design tools to be imported and executed via terminal commands. Do not include execution code that runs when tools.py is imported.
|
18 |
+
Follow Best Practices:
|
19 |
+
|
20 |
+
Use clear and descriptive function names that reflect their general purpose.
|
21 |
+
Include docstrings for each function, detailing the purpose, parameters, and expected outputs.
|
22 |
+
Adhere to PEP 8 style guidelines for readable and maintainable code.
|
23 |
+
3. Generalization:
|
24 |
+
|
25 |
+
Broad Input Handling:
|
26 |
+
|
27 |
+
Design functions to handle a wide range of inputs, enhancing reusability for future tasks.
|
28 |
+
Accept parameters that allow the function to be applicable in various scenarios (e.g., any stock ticker, URL, or data file).
|
29 |
+
Flexible Functionality:
|
30 |
+
|
31 |
+
Ensure functions can process different data types and structures when applicable.
|
32 |
+
Avoid hardcoding values; use parameters and defaults where necessary.
|
33 |
+
Modularity:
|
34 |
+
|
35 |
+
If a task involves multiple distinct operations, split them into separate functions.
|
36 |
+
This approach enhances clarity and allows for individual functions to be reused independently.
|
37 |
+
4. Execution and Script Management:
|
38 |
+
|
39 |
+
Import and Run via Terminal:
|
40 |
+
|
41 |
+
Do not execute tools.py directly. Instead, import the necessary functions and run them using the terminal.
|
42 |
+
Use the command:
|
43 |
+
bash
|
44 |
+
Copy code
|
45 |
+
python -c "from tools import function_name; function_name(args)"
|
46 |
+
Replace function_name and args with the appropriate function and arguments.
|
47 |
+
Avoid Additional Scripts:
|
48 |
+
|
49 |
+
Do not create extra .py files or scripts for execution purposes.
|
50 |
+
Keep all tool functions within tools.py and execute them using the import method shown above.
|
51 |
+
5. Output:
|
52 |
+
|
53 |
+
Console Printing:
|
54 |
+
|
55 |
+
Ensure that all tools print their output directly to the console.
|
56 |
+
Format the output for readability, using clear messages and organizing data in a logical manner.
|
57 |
+
No Return Statements for Output:
|
58 |
+
|
59 |
+
While functions can return values for internal use, the primary results should be displayed using print() statements.
|
60 |
+
6. Error Handling:
|
61 |
+
|
62 |
+
Input Validation:
|
63 |
+
|
64 |
+
Validate all input parameters to catch errors before execution.
|
65 |
+
Provide informative error messages to guide correct usage.
|
66 |
+
Exception Management:
|
67 |
+
|
68 |
+
Use try-except blocks to handle potential exceptions without crashing the program.
|
69 |
+
Log errors where appropriate, and ensure they don't expose sensitive information.
|
70 |
+
Debugging and Testing:
|
71 |
+
|
72 |
+
Test functions with various inputs, including edge cases, to ensure robustness.
|
73 |
+
If errors are found, revise and debug the functions promptly.
|
74 |
+
7. Post-Creation:
|
75 |
+
|
76 |
+
Execute to Fulfill Requests:
|
77 |
+
|
78 |
+
After creating or updating tools, execute them as needed to fulfill user requests unless the request was solely for tool creation.
|
79 |
+
Documentation:
|
80 |
+
|
81 |
+
Update any relevant documentation or comments within tools.py to reflect new additions or changes.
|
82 |
+
Consider maintaining a usage example within the docstring for complex functions.
|
83 |
+
8. Maintenance and Refactoring:
|
84 |
+
|
85 |
+
Regular Review:
|
86 |
+
|
87 |
+
Periodically review tools.py to identify opportunities for optimization and improvement.
|
88 |
+
Remove or update deprecated functions that are no longer effective or necessary.
|
89 |
+
Enhance Generalization:
|
90 |
+
|
91 |
+
Refactor functions to improve their general applicability as new requirements emerge.
|
92 |
+
Stay vigilant for patterns that can be abstracted into more general solutions.
|
93 |
+
9. Compliance and Security:
|
94 |
+
|
95 |
+
Data Protection:
|
96 |
+
|
97 |
+
Ensure that tools handle data securely, especially when dealing with sensitive information.
|
98 |
+
Avoid hardcoding credentials or exposing private data through outputs.
|
99 |
+
Licensing and Dependencies:
|
100 |
+
|
101 |
+
Verify that any third-party libraries used are properly licensed and documented.
|
102 |
+
Include installation instructions for dependencies if they are not part of the standard library.
|
103 |
+
'''
|
104 |
+
|
105 |
+
# Your tools will be defined below this line
|
106 |
+
|
107 |
+
import os
|
108 |
+
from datetime import datetime, timedelta
|
109 |
+
import PyPDF2
|
110 |
+
from pathlib import Path
|
111 |
+
import json
|
112 |
+
import re
|
113 |
+
import requests
|
114 |
+
from typing import List, Dict, Optional, Union
|
115 |
+
import subprocess
|
116 |
+
import sys
|
117 |
+
from git import Repo
|
118 |
+
|
119 |
+
def get_current_month_folder() -> str:
|
120 |
+
"""Returns the path to the current month's folder."""
|
121 |
+
base_path = r"C:\Users\admin\Dropbox\Current\2024"
|
122 |
+
current_month = datetime.now().strftime("%B") # Full month name
|
123 |
+
return os.path.join(base_path, current_month)
|
124 |
+
|
125 |
+
def get_pdfs_for_date(target_date: datetime = None) -> List[str]:
|
126 |
+
"""
|
127 |
+
Finds all PDFs saved on a specific date in the current month's folder structure.
|
128 |
+
Args:
|
129 |
+
target_date: datetime object for the target date. If None, uses today's date.
|
130 |
+
Returns a list of full paths to PDF files.
|
131 |
+
"""
|
132 |
+
if target_date is None:
|
133 |
+
target_date = datetime.now()
|
134 |
+
|
135 |
+
target_date_str = target_date.strftime("%Y-%m-%d")
|
136 |
+
month_folder = get_current_month_folder()
|
137 |
+
pdf_files = []
|
138 |
+
|
139 |
+
# Walk through all subdirectories
|
140 |
+
for root, _, files in os.walk(month_folder):
|
141 |
+
for file in files:
|
142 |
+
if file.lower().endswith('.pdf'):
|
143 |
+
file_path = os.path.join(root, file)
|
144 |
+
# Get file's modification time
|
145 |
+
mod_time = datetime.fromtimestamp(os.path.getmtime(file_path))
|
146 |
+
if mod_time.strftime("%Y-%m-%d") == target_date_str:
|
147 |
+
pdf_files.append(file_path)
|
148 |
+
|
149 |
+
return pdf_files
|
150 |
+
|
151 |
+
def extract_text_from_pdf(pdf_path: str) -> str:
|
152 |
+
"""Extract text content from a PDF file."""
|
153 |
+
try:
|
154 |
+
with open(pdf_path, 'rb') as file:
|
155 |
+
reader = PyPDF2.PdfReader(file)
|
156 |
+
text = ""
|
157 |
+
for page in reader.pages:
|
158 |
+
text += page.extract_text() + "\n"
|
159 |
+
return text
|
160 |
+
except Exception as e:
|
161 |
+
print(f"Error processing {pdf_path}: {str(e)}")
|
162 |
+
return ""
|
163 |
+
|
164 |
+
def summarize_pdfs_for_date(target_date: datetime = None):
|
165 |
+
"""
|
166 |
+
Main function to process and summarize PDFs for a specific date.
|
167 |
+
Args:
|
168 |
+
target_date: datetime object for the target date. If None, uses today's date.
|
169 |
+
Prints summary to console and saves to a JSON file.
|
170 |
+
"""
|
171 |
+
if target_date is None:
|
172 |
+
target_date = datetime.now()
|
173 |
+
|
174 |
+
pdfs = get_pdfs_for_date(target_date)
|
175 |
+
if not pdfs:
|
176 |
+
print(f"No PDFs found for {target_date.strftime('%Y-%m-%d')}")
|
177 |
+
return
|
178 |
+
|
179 |
+
summaries = {}
|
180 |
+
for pdf_path in pdfs:
|
181 |
+
print(f"Processing: {pdf_path}")
|
182 |
+
text = extract_text_from_pdf(pdf_path)
|
183 |
+
|
184 |
+
# Basic summary: first 500 characters of text
|
185 |
+
summary = text[:500] + "..." if len(text) > 500 else text
|
186 |
+
|
187 |
+
# Store in dictionary with filename as key
|
188 |
+
filename = os.path.basename(pdf_path)
|
189 |
+
summaries[filename] = {
|
190 |
+
"path": pdf_path,
|
191 |
+
"summary": summary,
|
192 |
+
"processed_date": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
193 |
+
}
|
194 |
+
|
195 |
+
# Save summaries to JSON file in the current directory
|
196 |
+
output_dir = "summaries"
|
197 |
+
os.makedirs(output_dir, exist_ok=True)
|
198 |
+
|
199 |
+
output_file = os.path.join(output_dir, f"summaries_{target_date.strftime('%Y-%m-%d')}.json")
|
200 |
+
with open(output_file, 'w', encoding='utf-8') as f:
|
201 |
+
json.dump(summaries, f, indent=4, ensure_ascii=False)
|
202 |
+
|
203 |
+
print(f"\nProcessed {len(pdfs)} PDFs")
|
204 |
+
print(f"Summaries saved to: {output_file}")
|
205 |
+
|
206 |
+
# Print summaries to console
|
207 |
+
for filename, data in summaries.items():
|
208 |
+
print(f"\n{'='*80}\n{filename}")
|
209 |
+
print(f"Path: {data['path']}")
|
210 |
+
print(f"\nSummary:\n{data['summary'][:200]}...")
|
211 |
+
|
212 |
+
class WebSearchTool:
|
213 |
+
"""
|
214 |
+
A tool for performing web searches using the Perplexity API.
|
215 |
+
Designed to be the default web search mechanism for context-requiring queries.
|
216 |
+
"""
|
217 |
+
|
218 |
+
def __init__(self, api_key: Optional[str] = None):
|
219 |
+
"""
|
220 |
+
Initialize the WebSearchTool.
|
221 |
+
|
222 |
+
Args:
|
223 |
+
api_key: Perplexity API key. If None, will try to get from environment variable.
|
224 |
+
"""
|
225 |
+
self.api_key = api_key or os.getenv("PERPLEXITY_API_KEY")
|
226 |
+
if not self.api_key:
|
227 |
+
raise ValueError("Perplexity API key must be provided or set in PERPLEXITY_API_KEY environment variable")
|
228 |
+
|
229 |
+
self.headers = {
|
230 |
+
"Authorization": f"Bearer {self.api_key}",
|
231 |
+
"Content-Type": "application/json"
|
232 |
+
}
|
233 |
+
|
234 |
+
def search(self, query: str, max_results: int = 5) -> Dict:
|
235 |
+
"""
|
236 |
+
Perform a web search using Perplexity API.
|
237 |
+
|
238 |
+
Args:
|
239 |
+
query: The search query
|
240 |
+
max_results: Maximum number of results to return
|
241 |
+
|
242 |
+
Returns:
|
243 |
+
Dictionary containing search results and metadata
|
244 |
+
"""
|
245 |
+
try:
|
246 |
+
# Make the API request
|
247 |
+
response = requests.post(
|
248 |
+
"https://api.perplexity.ai/chat/completions",
|
249 |
+
headers=self.headers,
|
250 |
+
json={
|
251 |
+
"model": "llama-3.1-sonar-huge-128k-online",
|
252 |
+
"messages": [
|
253 |
+
{
|
254 |
+
"role": "system",
|
255 |
+
"content": "You are a helpful assistant that provides accurate and concise answers based on web search results."
|
256 |
+
},
|
257 |
+
{
|
258 |
+
"role": "user",
|
259 |
+
"content": query
|
260 |
+
}
|
261 |
+
]
|
262 |
+
}
|
263 |
+
)
|
264 |
+
response.raise_for_status()
|
265 |
+
data = response.json()
|
266 |
+
|
267 |
+
# Extract answer from the response
|
268 |
+
answer = data.get("choices", [{}])[0].get("message", {}).get("content", "No answer found")
|
269 |
+
|
270 |
+
# Process and format the results
|
271 |
+
results = {
|
272 |
+
"query": query,
|
273 |
+
"timestamp": datetime.now().isoformat(),
|
274 |
+
"answer": answer,
|
275 |
+
"references": [], # References not available in this API version
|
276 |
+
"metadata": {
|
277 |
+
"source": "Perplexity API",
|
278 |
+
"model": "llama-3.1-sonar-huge-128k-online"
|
279 |
+
}
|
280 |
+
}
|
281 |
+
return results
|
282 |
+
|
283 |
+
except Exception as e:
|
284 |
+
print(f"Error performing search: {str(e)}")
|
285 |
+
return {
|
286 |
+
"query": query,
|
287 |
+
"timestamp": datetime.now().isoformat(),
|
288 |
+
"error": str(e),
|
289 |
+
"metadata": {
|
290 |
+
"source": "Perplexity API",
|
291 |
+
"status": "error"
|
292 |
+
}
|
293 |
+
}
|
294 |
+
|
295 |
+
def format_results(self, results: Dict, format: str = "text") -> str:
|
296 |
+
"""
|
297 |
+
Format search results in the specified format.
|
298 |
+
|
299 |
+
Args:
|
300 |
+
results: Search results dictionary
|
301 |
+
format: Output format ("text" or "markdown")
|
302 |
+
|
303 |
+
Returns:
|
304 |
+
Formatted string of results
|
305 |
+
"""
|
306 |
+
if "error" in results:
|
307 |
+
return f"Error: {results['error']}"
|
308 |
+
|
309 |
+
if format == "markdown":
|
310 |
+
output = f"# Search Results for: {results['query']}\n\n"
|
311 |
+
output += f"## Answer\n{results['answer']}\n\n"
|
312 |
+
if results['references']:
|
313 |
+
output += "## References\n"
|
314 |
+
for i, ref in enumerate(results['references'], 1):
|
315 |
+
output += f"{i}. {ref['title']} - {ref['url']}\n"
|
316 |
+
return output
|
317 |
+
else:
|
318 |
+
output = f"Search Results for: {results['query']}\n\n"
|
319 |
+
output += f"Answer:\n{results['answer']}\n\n"
|
320 |
+
if results['references']:
|
321 |
+
output += "References:\n"
|
322 |
+
for i, ref in enumerate(results['references'], 1):
|
323 |
+
output += f"{i}. {ref['title']} - {ref['url']}\n"
|
324 |
+
return output
|
325 |
+
|
326 |
+
class MCPServerManager:
|
327 |
+
"""
|
328 |
+
A tool for installing and managing Model Context Protocol (MCP) servers.
|
329 |
+
Integrates with Claude desktop and manages server configurations.
|
330 |
+
"""
|
331 |
+
|
332 |
+
DEFAULT_CONFIG_LOCATIONS = [
|
333 |
+
"mcp.json",
|
334 |
+
".mcp/config.json",
|
335 |
+
"config/mcp.json",
|
336 |
+
"mcp_config.json",
|
337 |
+
".config/mcp/servers.json"
|
338 |
+
]
|
339 |
+
|
340 |
+
def __init__(self, base_dir: Optional[str] = None):
|
341 |
+
"""
|
342 |
+
Initialize the MCP Server Manager.
|
343 |
+
|
344 |
+
Args:
|
345 |
+
base_dir: Base directory for installing servers. If None, uses current directory.
|
346 |
+
"""
|
347 |
+
self.base_dir = Path(base_dir) if base_dir else Path.cwd() / "mcp_servers"
|
348 |
+
self.base_dir.mkdir(parents=True, exist_ok=True)
|
349 |
+
self.servers_repo_url = "https://github.com/modelcontextprotocol/servers.git"
|
350 |
+
self.installed_servers = {}
|
351 |
+
self.load_installed_servers()
|
352 |
+
|
353 |
+
def load_installed_servers(self):
|
354 |
+
"""Load information about installed servers from the config file."""
|
355 |
+
config_file = self.base_dir / "config.json"
|
356 |
+
if config_file.exists():
|
357 |
+
with open(config_file, "r") as f:
|
358 |
+
self.installed_servers = json.load(f)
|
359 |
+
|
360 |
+
def save_installed_servers(self):
|
361 |
+
"""Save information about installed servers to the config file."""
|
362 |
+
config_file = self.base_dir / "config.json"
|
363 |
+
with open(config_file, "w") as f:
|
364 |
+
json.dump(self.installed_servers, f, indent=4)
|
365 |
+
|
366 |
+
def get_featured_servers(self) -> List[Dict]:
|
367 |
+
"""
|
368 |
+
Get list of featured servers from the MCP GitHub repository.
|
369 |
+
|
370 |
+
Returns:
|
371 |
+
List of server information dictionaries
|
372 |
+
"""
|
373 |
+
try:
|
374 |
+
# Clone or update the servers repository
|
375 |
+
repo_dir = self.base_dir / "servers_repo"
|
376 |
+
if repo_dir.exists():
|
377 |
+
repo = Repo(repo_dir)
|
378 |
+
repo.remotes.origin.pull()
|
379 |
+
else:
|
380 |
+
repo = Repo.clone_from(self.servers_repo_url, repo_dir)
|
381 |
+
|
382 |
+
# First try to read from featured.json
|
383 |
+
featured_file = repo_dir / "featured.json"
|
384 |
+
if featured_file.exists():
|
385 |
+
with open(featured_file, "r", encoding="utf-8") as f:
|
386 |
+
return json.load(f)
|
387 |
+
|
388 |
+
# If featured.json doesn't exist, parse README.md
|
389 |
+
readme_file = repo_dir / "README.md"
|
390 |
+
if readme_file.exists():
|
391 |
+
servers = []
|
392 |
+
with open(readme_file, "r", encoding="utf-8", errors="ignore") as f:
|
393 |
+
content = f.read()
|
394 |
+
# Look for server repository links
|
395 |
+
repo_links = re.findall(r"\[([^\]]+)\]\((https://github.com/[^)]+)\)", content)
|
396 |
+
for name, url in repo_links:
|
397 |
+
if "/modelcontextprotocol/" in url:
|
398 |
+
servers.append({
|
399 |
+
"name": name,
|
400 |
+
"repository": url,
|
401 |
+
"config": {}
|
402 |
+
})
|
403 |
+
return servers
|
404 |
+
|
405 |
+
return []
|
406 |
+
|
407 |
+
except Exception as e:
|
408 |
+
print(f"Error getting featured servers: {str(e)}")
|
409 |
+
return []
|
410 |
+
|
411 |
+
def find_server_config(self, server_dir: Path) -> Optional[Dict]:
|
412 |
+
"""
|
413 |
+
Search for MCP server configuration in common locations.
|
414 |
+
|
415 |
+
Args:
|
416 |
+
server_dir: Directory to search in
|
417 |
+
|
418 |
+
Returns:
|
419 |
+
Server configuration dictionary if found, None otherwise
|
420 |
+
"""
|
421 |
+
# First check common config file locations
|
422 |
+
for config_path in self.DEFAULT_CONFIG_LOCATIONS:
|
423 |
+
config_file = server_dir / config_path
|
424 |
+
if config_file.exists():
|
425 |
+
try:
|
426 |
+
with open(config_file, "r", encoding="utf-8") as f:
|
427 |
+
config = json.load(f)
|
428 |
+
if "mcpServers" in config:
|
429 |
+
return config["mcpServers"]
|
430 |
+
except Exception as e:
|
431 |
+
print(f"Error reading config from {config_file}: {str(e)}")
|
432 |
+
|
433 |
+
# Check package.json for Node.js projects
|
434 |
+
package_json = server_dir / "package.json"
|
435 |
+
if package_json.exists():
|
436 |
+
try:
|
437 |
+
with open(package_json, "r", encoding="utf-8") as f:
|
438 |
+
config = json.load(f)
|
439 |
+
if "mcpServers" in config:
|
440 |
+
return config["mcpServers"]
|
441 |
+
except Exception as e:
|
442 |
+
print(f"Error reading config from package.json: {str(e)}")
|
443 |
+
|
444 |
+
# Check pyproject.toml for Python projects
|
445 |
+
pyproject_toml = server_dir / "pyproject.toml"
|
446 |
+
if pyproject_toml.exists():
|
447 |
+
try:
|
448 |
+
import tomli
|
449 |
+
with open(pyproject_toml, "rb") as f:
|
450 |
+
config = tomli.load(f)
|
451 |
+
if "tool" in config and "mcp" in config["tool"]:
|
452 |
+
return {"python": config["tool"]["mcp"]}
|
453 |
+
except ImportError:
|
454 |
+
print("tomli package not found, skipping pyproject.toml parsing")
|
455 |
+
except Exception as e:
|
456 |
+
print(f"Error reading config from pyproject.toml: {str(e)}")
|
457 |
+
|
458 |
+
return None
|
459 |
+
|
460 |
+
def install_server(self, server_name: str, custom_config: Optional[Dict] = None) -> bool:
|
461 |
+
"""
|
462 |
+
Install a specific MCP server.
|
463 |
+
|
464 |
+
Args:
|
465 |
+
server_name: Name of the server to install
|
466 |
+
custom_config: Optional custom configuration for the server
|
467 |
+
|
468 |
+
Returns:
|
469 |
+
True if installation was successful, False otherwise
|
470 |
+
"""
|
471 |
+
try:
|
472 |
+
# Get server information from featured servers
|
473 |
+
featured_servers = self.get_featured_servers()
|
474 |
+
server_info = next((s for s in featured_servers if s["name"] == server_name), None)
|
475 |
+
if not server_info:
|
476 |
+
print(f"Server '{server_name}' not found in featured servers")
|
477 |
+
return False
|
478 |
+
|
479 |
+
# Create server directory
|
480 |
+
server_dir = self.base_dir / server_name
|
481 |
+
server_dir.mkdir(exist_ok=True)
|
482 |
+
|
483 |
+
# Clone server repository
|
484 |
+
repo = Repo.clone_from(server_info["repository"], server_dir)
|
485 |
+
|
486 |
+
# Install dependencies
|
487 |
+
if (server_dir / "requirements.txt").exists():
|
488 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"], cwd=server_dir)
|
489 |
+
elif (server_dir / "package.json").exists():
|
490 |
+
subprocess.run(["npm", "install"], cwd=server_dir)
|
491 |
+
|
492 |
+
# Find or use custom server configuration
|
493 |
+
config = custom_config or self.find_server_config(server_dir) or {}
|
494 |
+
|
495 |
+
# Store server information
|
496 |
+
self.installed_servers[server_name] = {
|
497 |
+
"path": str(server_dir),
|
498 |
+
"version": repo.head.commit.hexsha[:8],
|
499 |
+
"install_date": datetime.now().isoformat(),
|
500 |
+
"config": config
|
501 |
+
}
|
502 |
+
self.save_installed_servers()
|
503 |
+
|
504 |
+
print(f"Successfully installed {server_name}")
|
505 |
+
if config:
|
506 |
+
print(f"Found server configuration: {json.dumps(config, indent=2)}")
|
507 |
+
else:
|
508 |
+
print("No server configuration found. You may need to configure it manually.")
|
509 |
+
|
510 |
+
return True
|
511 |
+
|
512 |
+
except Exception as e:
|
513 |
+
print(f"Error installing server '{server_name}': {str(e)}")
|
514 |
+
return False
|
515 |
+
|
516 |
+
def uninstall_server(self, server_name: str) -> bool:
|
517 |
+
"""
|
518 |
+
Uninstall a specific MCP server.
|
519 |
+
|
520 |
+
Args:
|
521 |
+
server_name: Name of the server to uninstall
|
522 |
+
|
523 |
+
Returns:
|
524 |
+
True if uninstallation was successful, False otherwise
|
525 |
+
"""
|
526 |
+
try:
|
527 |
+
if server_name not in self.installed_servers:
|
528 |
+
print(f"Server '{server_name}' is not installed")
|
529 |
+
return False
|
530 |
+
|
531 |
+
# Remove server directory
|
532 |
+
server_dir = Path(self.installed_servers[server_name]["path"])
|
533 |
+
shutil.rmtree(server_dir)
|
534 |
+
|
535 |
+
# Remove from installed servers
|
536 |
+
del self.installed_servers[server_name]
|
537 |
+
self.save_installed_servers()
|
538 |
+
|
539 |
+
print(f"Successfully uninstalled {server_name}")
|
540 |
+
return True
|
541 |
+
|
542 |
+
except Exception as e:
|
543 |
+
print(f"Error uninstalling server '{server_name}': {str(e)}")
|
544 |
+
return False
|
545 |
+
|
546 |
+
def list_installed_servers(self) -> Dict[str, Dict]:
|
547 |
+
"""
|
548 |
+
Get information about installed servers.
|
549 |
+
|
550 |
+
Returns:
|
551 |
+
Dictionary of installed server information
|
552 |
+
"""
|
553 |
+
return self.installed_servers
|
554 |
+
|
555 |
+
def update_server(self, server_name: str) -> bool:
|
556 |
+
"""
|
557 |
+
Update a specific MCP server to the latest version.
|
558 |
+
|
559 |
+
Args:
|
560 |
+
server_name: Name of the server to update
|
561 |
+
|
562 |
+
Returns:
|
563 |
+
True if update was successful, False otherwise
|
564 |
+
"""
|
565 |
+
try:
|
566 |
+
if server_name not in self.installed_servers:
|
567 |
+
print(f"Server '{server_name}' is not installed")
|
568 |
+
return False
|
569 |
+
|
570 |
+
server_dir = Path(self.installed_servers[server_name]["path"])
|
571 |
+
repo = Repo(server_dir)
|
572 |
+
|
573 |
+
# Get current version
|
574 |
+
old_version = repo.head.commit.hexsha[:8]
|
575 |
+
|
576 |
+
# Pull latest changes
|
577 |
+
repo.remotes.origin.pull()
|
578 |
+
|
579 |
+
# Get new version
|
580 |
+
new_version = repo.head.commit.hexsha[:8]
|
581 |
+
|
582 |
+
# Update dependencies if needed
|
583 |
+
if (server_dir / "requirements.txt").exists():
|
584 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "-r", "requirements.txt"], cwd=server_dir)
|
585 |
+
|
586 |
+
# Update stored information
|
587 |
+
self.installed_servers[server_name]["version"] = new_version
|
588 |
+
self.save_installed_servers()
|
589 |
+
|
590 |
+
print(f"Updated {server_name} from {old_version} to {new_version}")
|
591 |
+
return True
|
592 |
+
|
593 |
+
except Exception as e:
|
594 |
+
print(f"Error updating server '{server_name}': {str(e)}")
|
595 |
+
return False
|
596 |
+
|
597 |
+
def configure_server(self, server_name: str, config: Dict) -> bool:
|
598 |
+
"""
|
599 |
+
Configure a specific MCP server.
|
600 |
+
|
601 |
+
Args:
|
602 |
+
server_name: Name of the server to configure
|
603 |
+
config: Configuration dictionary in the format:
|
604 |
+
{
|
605 |
+
"command": str, # Command to run the server
|
606 |
+
"args": List[str], # Arguments for the command
|
607 |
+
"env": Dict[str, str], # Optional environment variables
|
608 |
+
"cwd": str, # Optional working directory
|
609 |
+
}
|
610 |
+
|
611 |
+
Returns:
|
612 |
+
True if configuration was successful, False otherwise
|
613 |
+
"""
|
614 |
+
try:
|
615 |
+
if server_name not in self.installed_servers:
|
616 |
+
print(f"Server '{server_name}' is not installed")
|
617 |
+
return False
|
618 |
+
|
619 |
+
server_dir = Path(self.installed_servers[server_name]["path"])
|
620 |
+
|
621 |
+
# Validate configuration
|
622 |
+
if "command" not in config:
|
623 |
+
print("Error: Server configuration must include 'command'")
|
624 |
+
return False
|
625 |
+
|
626 |
+
# Update configuration
|
627 |
+
self.installed_servers[server_name]["config"] = config
|
628 |
+
self.save_installed_servers()
|
629 |
+
|
630 |
+
# Try to write configuration to a standard location
|
631 |
+
config_dir = server_dir / ".mcp"
|
632 |
+
config_dir.mkdir(exist_ok=True)
|
633 |
+
config_file = config_dir / "config.json"
|
634 |
+
|
635 |
+
with open(config_file, "w", encoding="utf-8") as f:
|
636 |
+
json.dump({"mcpServers": {server_name: config}}, f, indent=2)
|
637 |
+
|
638 |
+
print(f"Successfully configured {server_name}")
|
639 |
+
print(f"Configuration saved to {config_file}")
|
640 |
+
return True
|
641 |
+
|
642 |
+
except Exception as e:
|
643 |
+
print(f"Error configuring server '{server_name}': {str(e)}")
|
644 |
+
return False
|
645 |
+
|
646 |
+
def start_server(self, server_name: str) -> bool:
|
647 |
+
"""
|
648 |
+
Start a specific MCP server.
|
649 |
+
|
650 |
+
Args:
|
651 |
+
server_name: Name of the server to start
|
652 |
+
|
653 |
+
Returns:
|
654 |
+
True if server was started successfully, False otherwise
|
655 |
+
"""
|
656 |
+
try:
|
657 |
+
if server_name not in self.installed_servers:
|
658 |
+
print(f"Server '{server_name}' is not installed")
|
659 |
+
return False
|
660 |
+
|
661 |
+
server_info = self.installed_servers[server_name]
|
662 |
+
config = server_info.get("config", {})
|
663 |
+
|
664 |
+
if not config:
|
665 |
+
print(f"Server '{server_name}' is not configured")
|
666 |
+
return False
|
667 |
+
|
668 |
+
# Prepare command and arguments
|
669 |
+
command = config.get("command")
|
670 |
+
args = config.get("args", [])
|
671 |
+
env = {**os.environ, **(config.get("env", {}))}
|
672 |
+
cwd = config.get("cwd") or server_info["path"]
|
673 |
+
|
674 |
+
# Start the server process
|
675 |
+
process = subprocess.Popen(
|
676 |
+
[command, *args],
|
677 |
+
env=env,
|
678 |
+
cwd=cwd,
|
679 |
+
stdout=subprocess.PIPE,
|
680 |
+
stderr=subprocess.PIPE,
|
681 |
+
text=True
|
682 |
+
)
|
683 |
+
|
684 |
+
# Store process information
|
685 |
+
self.installed_servers[server_name]["process"] = process
|
686 |
+
self.save_installed_servers()
|
687 |
+
|
688 |
+
print(f"Started server '{server_name}' (PID: {process.pid})")
|
689 |
+
return True
|
690 |
+
|
691 |
+
except Exception as e:
|
692 |
+
print(f"Error starting server '{server_name}': {str(e)}")
|
693 |
+
return False
|
694 |
+
|
695 |
+
def stop_server(self, server_name: str) -> bool:
|
696 |
+
"""
|
697 |
+
Stop a specific MCP server.
|
698 |
+
|
699 |
+
Args:
|
700 |
+
server_name: Name of the server to stop
|
701 |
+
|
702 |
+
Returns:
|
703 |
+
True if server was stopped successfully, False otherwise
|
704 |
+
"""
|
705 |
+
try:
|
706 |
+
if server_name not in self.installed_servers:
|
707 |
+
print(f"Server '{server_name}' is not installed")
|
708 |
+
return False
|
709 |
+
|
710 |
+
server_info = self.installed_servers[server_name]
|
711 |
+
process = server_info.get("process")
|
712 |
+
|
713 |
+
if not process:
|
714 |
+
print(f"Server '{server_name}' is not running")
|
715 |
+
return False
|
716 |
+
|
717 |
+
# Try to stop the process gracefully
|
718 |
+
process.terminate()
|
719 |
+
try:
|
720 |
+
process.wait(timeout=5)
|
721 |
+
except subprocess.TimeoutExpired:
|
722 |
+
process.kill()
|
723 |
+
|
724 |
+
# Remove process information
|
725 |
+
del self.installed_servers[server_name]["process"]
|
726 |
+
self.save_installed_servers()
|
727 |
+
|
728 |
+
print(f"Stopped server '{server_name}'")
|
729 |
+
return True
|
730 |
+
|
731 |
+
except Exception as e:
|
732 |
+
print(f"Error stopping server '{server_name}': {str(e)}")
|
733 |
+
return False
|
734 |
+
|
735 |
+
async def generate_pyflowchart(repo_path: str, output_dir: Optional[str] = None) -> None:
|
736 |
+
"""
|
737 |
+
Generate flowcharts for all Python files in a repository using pyflowchart.
|
738 |
+
Creates HTML flowcharts that can be viewed in a browser.
|
739 |
+
|
740 |
+
Args:
|
741 |
+
repo_path: Path to the repository
|
742 |
+
output_dir: Optional directory to save flowcharts. If None, creates a 'flowcharts' directory in the repo.
|
743 |
+
"""
|
744 |
+
try:
|
745 |
+
# Ensure pyflowchart is installed
|
746 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "pyflowchart"], check=True)
|
747 |
+
|
748 |
+
# Set up output directory
|
749 |
+
if output_dir is None:
|
750 |
+
output_dir = os.path.join(repo_path, 'flowcharts')
|
751 |
+
os.makedirs(output_dir, exist_ok=True)
|
752 |
+
|
753 |
+
# Generate HTML flowcharts
|
754 |
+
for root, _, files in os.walk(repo_path):
|
755 |
+
for file in files:
|
756 |
+
if file.endswith('.py'):
|
757 |
+
py_file = os.path.join(root, file)
|
758 |
+
|
759 |
+
# Skip empty files
|
760 |
+
if os.path.getsize(py_file) == 0:
|
761 |
+
print(f"Skipping empty file: {py_file}")
|
762 |
+
continue
|
763 |
+
|
764 |
+
# Check if file has actual Python code
|
765 |
+
with open(py_file, 'r', encoding='utf-8') as f:
|
766 |
+
content = f.read().strip()
|
767 |
+
if not content:
|
768 |
+
print(f"Skipping empty file: {py_file}")
|
769 |
+
continue
|
770 |
+
|
771 |
+
html_file = os.path.join(output_dir, f"{os.path.splitext(file)[0]}_flowchart.html")
|
772 |
+
|
773 |
+
# Generate flowchart HTML
|
774 |
+
print(f"Generating flowchart for {py_file}")
|
775 |
+
try:
|
776 |
+
subprocess.run([
|
777 |
+
sys.executable, "-m", "pyflowchart", py_file,
|
778 |
+
"--output", html_file
|
779 |
+
], check=True)
|
780 |
+
print(f"Saved HTML to {html_file}")
|
781 |
+
except subprocess.CalledProcessError as e:
|
782 |
+
print(f"Error generating flowchart for {py_file}: {str(e)}")
|
783 |
+
continue
|
784 |
+
|
785 |
+
print(f"\nFlowcharts generated in: {output_dir}")
|
786 |
+
|
787 |
+
except subprocess.CalledProcessError as e:
|
788 |
+
print(f"Error running pyflowchart: {str(e)}")
|
789 |
+
except Exception as e:
|
790 |
+
print(f"Error generating flowcharts: {str(e)}")
|
791 |
+
|
792 |
+
def extract_repo_context(repo_url: str, output_file: Optional[str] = None) -> None:
|
793 |
+
"""
|
794 |
+
Extract repository context using gitingest.com.
|
795 |
+
|
796 |
+
Args:
|
797 |
+
repo_url: GitHub repository URL
|
798 |
+
output_file: Optional file to save the context. If None, uses repo name with .txt extension.
|
799 |
+
"""
|
800 |
+
try:
|
801 |
+
# Convert github.com URL to gitingest.com
|
802 |
+
if "github.com" not in repo_url:
|
803 |
+
raise ValueError("Only GitHub repositories are supported")
|
804 |
+
|
805 |
+
ingest_url = repo_url.replace("github.com", "gitingest.com")
|
806 |
+
print(f"Fetching repository context from: {ingest_url}")
|
807 |
+
|
808 |
+
# Make request to gitingest.com
|
809 |
+
response = requests.get(ingest_url)
|
810 |
+
response.raise_for_status()
|
811 |
+
|
812 |
+
# Extract content
|
813 |
+
content = response.text
|
814 |
+
|
815 |
+
# Save to file
|
816 |
+
if output_file is None:
|
817 |
+
repo_name = repo_url.split('/')[-1].replace('.git', '')
|
818 |
+
output_file = f"{repo_name}_context.txt"
|
819 |
+
|
820 |
+
with open(output_file, 'w', encoding='utf-8') as f:
|
821 |
+
f.write(content)
|
822 |
+
|
823 |
+
print(f"Repository context saved to: {output_file}")
|
824 |
+
|
825 |
+
except requests.RequestException as e:
|
826 |
+
print(f"Error fetching repository context: {str(e)}")
|
827 |
+
except Exception as e:
|
828 |
+
print(f"Error extracting repository context: {str(e)}")
|
829 |
+
|
830 |
+
async def post_clone_actions(repo_url: str) -> None:
|
831 |
+
"""
|
832 |
+
Perform post-clone actions after a git clone operation:
|
833 |
+
1. Generate flowcharts for all Python files using pyflowchart
|
834 |
+
2. Extract repository context using gitingest.com
|
835 |
+
|
836 |
+
Args:
|
837 |
+
repo_url: URL of the repository that was just cloned
|
838 |
+
"""
|
839 |
+
try:
|
840 |
+
# Get the repository name from the URL
|
841 |
+
repo_name = repo_url.split('/')[-1].replace('.git', '')
|
842 |
+
repo_path = os.path.join(os.getcwd(), repo_name)
|
843 |
+
|
844 |
+
# Generate flowcharts
|
845 |
+
print("\nGenerating flowcharts...")
|
846 |
+
await generate_pyflowchart(repo_path)
|
847 |
+
|
848 |
+
# Extract repository context
|
849 |
+
print("\nExtracting repository context...")
|
850 |
+
extract_repo_context(repo_url)
|
851 |
+
|
852 |
+
except Exception as e:
|
853 |
+
print(f"Error in post-clone actions: {str(e)}")
|
854 |
+
|
855 |
+
# Example usage:
|
856 |
+
if __name__ == "__main__":
|
857 |
+
# Initialize the search tool
|
858 |
+
search_tool = WebSearchTool()
|
859 |
+
|
860 |
+
# Perform a simple search
|
861 |
+
results = search_tool.search("Latest developments in AI technology")
|
862 |
+
|
863 |
+
# Print formatted results
|
864 |
+
print(search_tool.format_results(results, "markdown"))
|