Raubachm commited on
Commit
601ff24
·
verified ·
1 Parent(s): 0d39cc3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +179 -0
README.md CHANGED
@@ -1,3 +1,182 @@
1
  ---
2
  license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ library_name: sentence-transformers
4
+ pipeline_tag: text-classification
5
  ---
6
+ # Model Card for Model ID
7
+
8
+ This model borrows from Greg Kamradt’s work here: https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/5_Levels_Of_Text_Splitting.ipynb. The idea is to segment text into semantically coherent chunks. The primary goal of this work is to use sentence transformers embeddings to represent the meaning of sentences and detects shifts in meaning to identify potential breakpoints between chunks.
9
+
10
+ ### Model Description
11
+
12
+ This model aims to segment a text into semantically coherent chunks. It uses sentence-transformers embeddings to represent the meaning of sentences and detect shifts in the meaning to identify potential breakpoints between chunks. There are two primary changes in function from Greg Kamradt excellent original work: 1) Introduce the use of sentence-transformer embeddings rather than OpenAI to provide an entirely open source implementation of semantic chunking, and 2) add functionality to merge smaller chunks with their most semantically similar neighbors to better normalize chunk size.
13
+ The goal is to use semantic Understanding to enable a model to consider the meaning of text segments rather than purely relying on punctuation or syntax, and to provide flexiblity so that the breakpoint_percentile_threshold and min_chunk_size can be adjusted to influence the granularity of the chunks.
14
+
15
+ General Outline:
16
+
17
+ Preprocessing
18
+
19
+ - Loading Text: Reads the text from the specified path.
20
+ - Sentence Tokenization: Splits the text into a list of individual sentences using nltk's sentence tokenizer.
21
+
22
+ Semantic Embeddings
23
+
24
+ - Model Loading: Loads a pre-trained Sentence Transformer model ( in this case: 'sentence-transformers/all-mpnet-base-v1').
25
+ - Embedding Generation: Converts each sentence into an embedding to represent its meaning.
26
+
27
+ Sentence Combination:
28
+
29
+ - Combines each sentence with its neighbors to form slightly larger units, helping the model understand the context in which changes of topic are likely to occur.
30
+
31
+ Breakpoint Identification
32
+
33
+ - Cosine Distance: Calculates cosine distances between embeddings of the combined sentences. These distances represent the degree of semantic dissimilarity.
34
+ - Percentile-Based Threshold: Determines a threshold based on a percentile of the distances (e.g., 95th percentile), where higher values indicate more significant semantic shifts.
35
+ - Locating Breaks: Identifies the indices of distances above the threshold, which mark potential breakpoints between chunks.
36
+ Chunk Creation:
37
+
38
+ - Splitting at Breakpoints: Divides the original sentences into chunks based on the identified breakpoints.
39
+
40
+ Chunk Merging:
41
+
42
+ - Minimum Chunk Size: Defines a minimum number of sentences to consider a chunk valid.
43
+ - Similarity-Based Merging: Merges smaller chunks with their most semantically similar neighbor based on cosine similarity between chunk embeddings.
44
+
45
+ Output:
46
+
47
+ - The model ultimately produces a list of text chunks (chunks), each representing a somewhat self-contained, semantically cohesive segment of the original text.
48
+
49
+ ## Usage
50
+
51
+ Using this chunker is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
52
+
53
+ ```
54
+ pip install -U sentence-transformers
55
+ ```
56
+
57
+ Then you can implement like this:
58
+
59
+ ```python
60
+ ---
61
+ license: mit
62
+ ---
63
+ import nltk
64
+ from nltk.tokenize import sent_tokenize
65
+ from sentence_transformers import SentenceTransformer
66
+ from sklearn.metrics.pairwise import cosine_similarity
67
+ import numpy as np
68
+ import matplotlib.pyplot as plt
69
+
70
+ # Text to be chunked
71
+ with open("/path to text") as f:
72
+ text = f.read()
73
+
74
+ # Tokenize the text into sentences
75
+ sentences = sent_tokenize(text)
76
+
77
+ # Generate embeddings for each sentence using sentence-transformers model of choice
78
+ model = SentenceTransformer('sentence-transformers/all-mpnet-base-v1')
79
+ embeddings = model.encode(sentences)
80
+
81
+ # Combine the sentences with their neighbors
82
+ def combine_sentences(sentences, buffer_size=1):
83
+ combined_sentences = []
84
+ for i in range(len(sentences)):
85
+ combined_sentence = ' '.join(sentences[max(0, i-buffer_size):min(len(sentences), i+1+buffer_size)])
86
+ combined_sentences.append(combined_sentence)
87
+ return combined_sentences
88
+
89
+ combined_sentences = combine_sentences(sentences)
90
+ combined_embeddings = model.encode(combined_sentences)
91
+
92
+ # Calculate cosine distances between embeddings
93
+ def calculate_cosine_distances(embeddings):
94
+ distances = []
95
+ for i in range(len(embeddings) - 1):
96
+ similarity = cosine_similarity([embeddings[i]], [embeddings[i + 1]])[0][0]
97
+ distance = 1 - similarity
98
+ distances.append(distance)
99
+ return distances
100
+
101
+ distances = calculate_cosine_distances(combined_embeddings)
102
+
103
+ # Identify breakpoints
104
+ # Adjust breakpoint threshhold to change the level of dissimilarity between chunk embeddings (higher for greater dissimilarity)
105
+
106
+ breakpoint_percentile_threshold = 95
107
+ breakpoint_distance_threshold = np.percentile(distances, breakpoint_percentile_threshold)
108
+ breakpoint_indices = [i for i, distance in enumerate(distances) if distance > breakpoint_distance_threshold]
109
+
110
+ # Create chunks based on breakpoints
111
+ chunks = []
112
+ start_index = 0
113
+ for breakpoint_index in breakpoint_indices:
114
+ chunk = ' '.join(sentences[start_index:breakpoint_index + 1])
115
+ chunks.append(chunk)
116
+ start_index = breakpoint_index + 1
117
+ chunks.append(' '.join(sentences[start_index:]))
118
+
119
+ # Set a minimum number of sentences per chunk
120
+ min_chunk_size = 3
121
+
122
+ # Merge small chunks with their most semantically similar neighbor
123
+ def merge_small_chunks_with_neighbors(chunks, embeddings):
124
+ merged_chunks = [chunks[0]] # Start with the first chunk
125
+ merged_embeddings = [embeddings[0]] # And its embedding
126
+
127
+ for i in range(1, len(chunks) - 1): # Iterate through chunks, excluding the first and last
128
+ # If the current chunk is small, consider merging it with a neighbor
129
+ if len(chunks[i].split('. ')) < min_chunk_size:
130
+ prev_similarity = cosine_similarity([embeddings[i]], [merged_embeddings[-1]])[0][0]
131
+ next_similarity = cosine_similarity([embeddings[i]], [embeddings[i + 1]])[0][0]
132
+
133
+ # Merge with the most similar neighbor
134
+ if prev_similarity > next_similarity:
135
+ merged_chunks[-1] += ' ' + chunks[i]
136
+ merged_embeddings[-1] = (merged_embeddings[-1] + embeddings[i]) / 2
137
+ else:
138
+ chunks[i + 1] = chunks[i] + ' ' + chunks[i + 1]
139
+ embeddings[i + 1] = (embeddings[i] + embeddings[i + 1]) / 2
140
+ else:
141
+ merged_chunks.append(chunks[i])
142
+ merged_embeddings.append(embeddings[i])
143
+
144
+ merged_chunks.append(chunks[-1])
145
+ merged_embeddings.append(embeddings[-1])
146
+
147
+ return merged_chunks, merged_embeddings
148
+
149
+ # Generate embeddings for each initial chunk and merge most semantically similar neighbors
150
+ chunk_embeddings = model.encode(chunks)
151
+ chunks, chunk_embeddings = merge_small_chunks_with_neighbors(chunks, chunk_embeddings)
152
+
153
+ print(chunks[0])
154
+ ```
155
+ ## Evaluation Results
156
+
157
+ Testing on various buffer sizes and breakpoints using the King James Version of the book of Romans (Available here: https://quod.lib.umich.edu/cgi/k/kjv/kjv-idx?type=DIV1&byte=5015363).
158
+
159
+ Intra-chunk similarity (how similar the sentences in a given chunk are to each other. Higher = More Semantically Similar):
160
+
161
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/JtjTFuh2DhEQCDwkOrrkb.png)
162
+
163
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/15RQ0-Lu8PvQ1IxxKJsGM.png)
164
+
165
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/K3XfmdKXyg75n1-77DK0X.png)
166
+
167
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/ojbecOyGaxDY1PTW1WOY4.png)
168
+
169
+ Inter-chunk similarity (how similar the respective chunks are to each other. Lower = Less Semantically Similar):
170
+
171
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/bI02qstwYwmom5Kfae34N.png)
172
+
173
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/MhLrbp_AuXJMtbrrPoIiO.png)
174
+
175
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/Zp-iZF_clPuxHA0CfRiJF.png)
176
+
177
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/646f846edcddb2358a256744/JumSdk0Vxi4zJtiCoA64B.png)
178
+
179
+ #### Citing and Authors
180
+
181
+ If you find this model helpful, please enjoy and give all credit to Greg Kamradt for the idea.
182
+