billusanda007 commited on
Commit
9a0151c
·
1 Parent(s): ed66bbb

Upload 5 files

Browse files
Files changed (5) hide show
  1. Stress identification NLP +0 -0
  2. Stress.csv +0 -0
  3. Untitled.ipynb +1220 -0
  4. app.py +104 -0
  5. tfidf_vectorizer.joblib +3 -0
Stress identification NLP ADDED
Binary file (257 kB). View file
 
Stress.csv ADDED
The diff for this file is too large to render. See raw diff
 
Untitled.ipynb ADDED
@@ -0,0 +1,1220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "id": "c6c16352",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": [
10
+ "import pandas as pd"
11
+ ]
12
+ },
13
+ {
14
+ "cell_type": "code",
15
+ "execution_count": 2,
16
+ "id": "9364c142",
17
+ "metadata": {},
18
+ "outputs": [],
19
+ "source": [
20
+ "df = pd.read_csv('stress.csv')"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "code",
25
+ "execution_count": 3,
26
+ "id": "d74d44fa",
27
+ "metadata": {},
28
+ "outputs": [
29
+ {
30
+ "data": {
31
+ "text/html": [
32
+ "<div>\n",
33
+ "<style scoped>\n",
34
+ " .dataframe tbody tr th:only-of-type {\n",
35
+ " vertical-align: middle;\n",
36
+ " }\n",
37
+ "\n",
38
+ " .dataframe tbody tr th {\n",
39
+ " vertical-align: top;\n",
40
+ " }\n",
41
+ "\n",
42
+ " .dataframe thead th {\n",
43
+ " text-align: right;\n",
44
+ " }\n",
45
+ "</style>\n",
46
+ "<table border=\"1\" class=\"dataframe\">\n",
47
+ " <thead>\n",
48
+ " <tr style=\"text-align: right;\">\n",
49
+ " <th></th>\n",
50
+ " <th>subreddit</th>\n",
51
+ " <th>post_id</th>\n",
52
+ " <th>sentence_range</th>\n",
53
+ " <th>text</th>\n",
54
+ " <th>label</th>\n",
55
+ " <th>confidence</th>\n",
56
+ " <th>social_timestamp</th>\n",
57
+ " </tr>\n",
58
+ " </thead>\n",
59
+ " <tbody>\n",
60
+ " <tr>\n",
61
+ " <th>0</th>\n",
62
+ " <td>ptsd</td>\n",
63
+ " <td>8601tu</td>\n",
64
+ " <td>(15, 20)</td>\n",
65
+ " <td>He said he had not felt that way before, sugge...</td>\n",
66
+ " <td>1</td>\n",
67
+ " <td>0.8</td>\n",
68
+ " <td>1521614353</td>\n",
69
+ " </tr>\n",
70
+ " <tr>\n",
71
+ " <th>1</th>\n",
72
+ " <td>assistance</td>\n",
73
+ " <td>8lbrx9</td>\n",
74
+ " <td>(0, 5)</td>\n",
75
+ " <td>Hey there r/assistance, Not sure if this is th...</td>\n",
76
+ " <td>0</td>\n",
77
+ " <td>1.0</td>\n",
78
+ " <td>1527009817</td>\n",
79
+ " </tr>\n",
80
+ " <tr>\n",
81
+ " <th>2</th>\n",
82
+ " <td>ptsd</td>\n",
83
+ " <td>9ch1zh</td>\n",
84
+ " <td>(15, 20)</td>\n",
85
+ " <td>My mom then hit me with the newspaper and it s...</td>\n",
86
+ " <td>1</td>\n",
87
+ " <td>0.8</td>\n",
88
+ " <td>1535935605</td>\n",
89
+ " </tr>\n",
90
+ " </tbody>\n",
91
+ "</table>\n",
92
+ "</div>"
93
+ ],
94
+ "text/plain": [
95
+ " subreddit post_id sentence_range \n",
96
+ "0 ptsd 8601tu (15, 20) \\\n",
97
+ "1 assistance 8lbrx9 (0, 5) \n",
98
+ "2 ptsd 9ch1zh (15, 20) \n",
99
+ "\n",
100
+ " text label confidence \n",
101
+ "0 He said he had not felt that way before, sugge... 1 0.8 \\\n",
102
+ "1 Hey there r/assistance, Not sure if this is th... 0 1.0 \n",
103
+ "2 My mom then hit me with the newspaper and it s... 1 0.8 \n",
104
+ "\n",
105
+ " social_timestamp \n",
106
+ "0 1521614353 \n",
107
+ "1 1527009817 \n",
108
+ "2 1535935605 "
109
+ ]
110
+ },
111
+ "execution_count": 3,
112
+ "metadata": {},
113
+ "output_type": "execute_result"
114
+ }
115
+ ],
116
+ "source": [
117
+ "df.head(3)"
118
+ ]
119
+ },
120
+ {
121
+ "cell_type": "code",
122
+ "execution_count": 4,
123
+ "id": "a9ab0f47",
124
+ "metadata": {},
125
+ "outputs": [
126
+ {
127
+ "data": {
128
+ "text/plain": [
129
+ "'He said he had not felt that way before, suggeted I go rest and so ..TRIGGER AHEAD IF YOUI\\'RE A HYPOCONDRIAC LIKE ME: i decide to look up \"feelings of doom\" in hopes of maybe getting sucked into some rabbit hole of ludicrous conspiracy, a stupid \"are you psychic\" test or new age b.s., something I could even laugh at down the road. No, I ended up reading that this sense of doom can be indicative of various health ailments; one of which I am prone to.. So on top of my \"doom\" to my gloom..I am now f\\'n worried about my heart. I do happen to have a physical in 48 hours.'"
130
+ ]
131
+ },
132
+ "execution_count": 4,
133
+ "metadata": {},
134
+ "output_type": "execute_result"
135
+ }
136
+ ],
137
+ "source": [
138
+ "df['text'][0]"
139
+ ]
140
+ },
141
+ {
142
+ "cell_type": "code",
143
+ "execution_count": 5,
144
+ "id": "cddf65aa",
145
+ "metadata": {},
146
+ "outputs": [],
147
+ "source": [
148
+ "import nltk\n",
149
+ "import re\n",
150
+ "from urllib.parse import urlparse\n",
151
+ "from spacy import load\n",
152
+ "from nltk.stem import WordNetLemmatizer\n",
153
+ "from nltk.corpus import stopwords\n",
154
+ "from nltk.tokenize import word_tokenize"
155
+ ]
156
+ },
157
+ {
158
+ "cell_type": "code",
159
+ "execution_count": 6,
160
+ "id": "183c5b14",
161
+ "metadata": {},
162
+ "outputs": [
163
+ {
164
+ "name": "stdout",
165
+ "output_type": "stream",
166
+ "text": [
167
+ "cp: /usr/share/nltk_data/corpora/wordnet2022: No such file or directory\r\n"
168
+ ]
169
+ },
170
+ {
171
+ "name": "stderr",
172
+ "output_type": "stream",
173
+ "text": [
174
+ "[nltk_data] Downloading package omw-1.4 to\n",
175
+ "[nltk_data] /Users/nileshpal/nltk_data...\n",
176
+ "[nltk_data] Package omw-1.4 is already up-to-date!\n",
177
+ "[nltk_data] Downloading package wordnet to\n",
178
+ "[nltk_data] /Users/nileshpal/nltk_data...\n",
179
+ "[nltk_data] Package wordnet is already up-to-date!\n",
180
+ "[nltk_data] Downloading package wordnet2022 to\n",
181
+ "[nltk_data] /Users/nileshpal/nltk_data...\n",
182
+ "[nltk_data] Package wordnet2022 is already up-to-date!\n",
183
+ "[nltk_data] Downloading package punkt to /Users/nileshpal/nltk_data...\n",
184
+ "[nltk_data] Package punkt is already up-to-date!\n",
185
+ "[nltk_data] Downloading package stopwords to\n",
186
+ "[nltk_data] /Users/nileshpal/nltk_data...\n",
187
+ "[nltk_data] Package stopwords is already up-to-date!\n"
188
+ ]
189
+ }
190
+ ],
191
+ "source": [
192
+ "nltk.download('omw-1.4')\n",
193
+ "nltk.download('wordnet') \n",
194
+ "nltk.download('wordnet2022')\n",
195
+ "nltk.download('punkt')\n",
196
+ "nltk.download('stopwords')\n",
197
+ "! cp -rf /usr/share/nltk_data/corpora/wordnet2022 /usr/share/nltk_data/corpora/wordnet"
198
+ ]
199
+ },
200
+ {
201
+ "cell_type": "code",
202
+ "execution_count": 7,
203
+ "id": "473ea714",
204
+ "metadata": {},
205
+ "outputs": [
206
+ {
207
+ "name": "stdout",
208
+ "output_type": "stream",
209
+ "text": [
210
+ "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n"
211
+ ]
212
+ }
213
+ ],
214
+ "source": [
215
+ "lemmatizer = WordNetLemmatizer()\n",
216
+ "stop_words = list(stopwords.words('english'))\n",
217
+ "print(stop_words)"
218
+ ]
219
+ },
220
+ {
221
+ "cell_type": "code",
222
+ "execution_count": 8,
223
+ "id": "d119482b",
224
+ "metadata": {},
225
+ "outputs": [],
226
+ "source": [
227
+ "def textPocess(sent):\n",
228
+ " try:\n",
229
+ " sent = re.sub('[][)(]',' ',sent)\n",
230
+ "\n",
231
+ " sent = [word for word in sent.split() if not urlparse(word).scheme]\n",
232
+ " sent = ' '.join(sent)\n",
233
+ "\n",
234
+ "\n",
235
+ " sent = re.sub(r'\\@\\w+','',sent)\n",
236
+ "\n",
237
+ "\n",
238
+ " sent = re.sub(re.compile(\"<.*?>\"),'',sent)\n",
239
+ "\n",
240
+ " sent = re.sub(\"[^A-Za-z0-9]\",' ',sent)\n",
241
+ "\n",
242
+ " sent = sent.lower()\n",
243
+ " \n",
244
+ " sent = [word.strip() for word in sent.split()]\n",
245
+ " sent = ' '.join(sent)\n",
246
+ "\n",
247
+ " tokens = word_tokenize(sent)\n",
248
+ " \n",
249
+ " for word in tokens:\n",
250
+ " if word in stop_words:\n",
251
+ " tokens.remove(word)\n",
252
+ " \n",
253
+ " sent = [lemmatizer.lemmatize(word) for word in tokens]\n",
254
+ " sent = ' '.join(sent)\n",
255
+ " return sent\n",
256
+ " \n",
257
+ " except Exception as ex:\n",
258
+ " print(sent,\"\\n\")\n",
259
+ " print(\"Error \",ex)"
260
+ ]
261
+ },
262
+ {
263
+ "cell_type": "code",
264
+ "execution_count": 9,
265
+ "id": "f2b42f59",
266
+ "metadata": {},
267
+ "outputs": [],
268
+ "source": [
269
+ "df['processed_text'] = df['text'].apply(lambda text: textPocess(text))"
270
+ ]
271
+ },
272
+ {
273
+ "cell_type": "code",
274
+ "execution_count": 10,
275
+ "id": "074cb312",
276
+ "metadata": {},
277
+ "outputs": [],
278
+ "source": [
279
+ "from sklearn.feature_extraction.text import TfidfVectorizer\n",
280
+ "MIN_DF = 1 "
281
+ ]
282
+ },
283
+ {
284
+ "cell_type": "code",
285
+ "execution_count": 11,
286
+ "id": "39f0bd84",
287
+ "metadata": {},
288
+ "outputs": [
289
+ {
290
+ "data": {
291
+ "text/plain": [
292
+ "array([[0, 0, 0, ..., 0, 0, 0],\n",
293
+ " [0, 0, 0, ..., 0, 0, 0],\n",
294
+ " [0, 0, 0, ..., 0, 0, 0],\n",
295
+ " ...,\n",
296
+ " [0, 0, 0, ..., 0, 0, 0],\n",
297
+ " [0, 0, 0, ..., 0, 0, 0],\n",
298
+ " [0, 0, 0, ..., 0, 0, 0]])"
299
+ ]
300
+ },
301
+ "execution_count": 11,
302
+ "metadata": {},
303
+ "output_type": "execute_result"
304
+ }
305
+ ],
306
+ "source": [
307
+ "from sklearn.feature_extraction.text import CountVectorizer\n",
308
+ "cv = CountVectorizer(min_df=MIN_DF)\n",
309
+ "cv_df = cv.fit_transform(df['processed_text'])\n",
310
+ "cv_df.toarray()"
311
+ ]
312
+ },
313
+ {
314
+ "cell_type": "code",
315
+ "execution_count": 12,
316
+ "id": "3c43aa15",
317
+ "metadata": {},
318
+ "outputs": [
319
+ {
320
+ "data": {
321
+ "text/html": [
322
+ "<div>\n",
323
+ "<style scoped>\n",
324
+ " .dataframe tbody tr th:only-of-type {\n",
325
+ " vertical-align: middle;\n",
326
+ " }\n",
327
+ "\n",
328
+ " .dataframe tbody tr th {\n",
329
+ " vertical-align: top;\n",
330
+ " }\n",
331
+ "\n",
332
+ " .dataframe thead th {\n",
333
+ " text-align: right;\n",
334
+ " }\n",
335
+ "</style>\n",
336
+ "<table border=\"1\" class=\"dataframe\">\n",
337
+ " <thead>\n",
338
+ " <tr style=\"text-align: right;\">\n",
339
+ " <th></th>\n",
340
+ " <th>00</th>\n",
341
+ " <th>000</th>\n",
342
+ " <th>02</th>\n",
343
+ " <th>06</th>\n",
344
+ " <th>10</th>\n",
345
+ " <th>100</th>\n",
346
+ " <th>1000</th>\n",
347
+ " <th>100kg</th>\n",
348
+ " <th>100mg</th>\n",
349
+ " <th>100x</th>\n",
350
+ " <th>...</th>\n",
351
+ " <th>zines</th>\n",
352
+ " <th>zinsser</th>\n",
353
+ " <th>zip</th>\n",
354
+ " <th>zofran</th>\n",
355
+ " <th>zoloft</th>\n",
356
+ " <th>zombie</th>\n",
357
+ " <th>zone</th>\n",
358
+ " <th>zoo</th>\n",
359
+ " <th>zuko</th>\n",
360
+ " <th>zumba</th>\n",
361
+ " </tr>\n",
362
+ " </thead>\n",
363
+ " <tbody>\n",
364
+ " <tr>\n",
365
+ " <th>0</th>\n",
366
+ " <td>0</td>\n",
367
+ " <td>0</td>\n",
368
+ " <td>0</td>\n",
369
+ " <td>0</td>\n",
370
+ " <td>0</td>\n",
371
+ " <td>0</td>\n",
372
+ " <td>0</td>\n",
373
+ " <td>0</td>\n",
374
+ " <td>0</td>\n",
375
+ " <td>0</td>\n",
376
+ " <td>...</td>\n",
377
+ " <td>0</td>\n",
378
+ " <td>0</td>\n",
379
+ " <td>0</td>\n",
380
+ " <td>0</td>\n",
381
+ " <td>0</td>\n",
382
+ " <td>0</td>\n",
383
+ " <td>0</td>\n",
384
+ " <td>0</td>\n",
385
+ " <td>0</td>\n",
386
+ " <td>0</td>\n",
387
+ " </tr>\n",
388
+ " <tr>\n",
389
+ " <th>1</th>\n",
390
+ " <td>0</td>\n",
391
+ " <td>0</td>\n",
392
+ " <td>0</td>\n",
393
+ " <td>0</td>\n",
394
+ " <td>0</td>\n",
395
+ " <td>0</td>\n",
396
+ " <td>0</td>\n",
397
+ " <td>0</td>\n",
398
+ " <td>0</td>\n",
399
+ " <td>0</td>\n",
400
+ " <td>...</td>\n",
401
+ " <td>0</td>\n",
402
+ " <td>0</td>\n",
403
+ " <td>0</td>\n",
404
+ " <td>0</td>\n",
405
+ " <td>0</td>\n",
406
+ " <td>0</td>\n",
407
+ " <td>0</td>\n",
408
+ " <td>0</td>\n",
409
+ " <td>0</td>\n",
410
+ " <td>0</td>\n",
411
+ " </tr>\n",
412
+ " <tr>\n",
413
+ " <th>2</th>\n",
414
+ " <td>0</td>\n",
415
+ " <td>0</td>\n",
416
+ " <td>0</td>\n",
417
+ " <td>0</td>\n",
418
+ " <td>0</td>\n",
419
+ " <td>0</td>\n",
420
+ " <td>0</td>\n",
421
+ " <td>0</td>\n",
422
+ " <td>0</td>\n",
423
+ " <td>0</td>\n",
424
+ " <td>...</td>\n",
425
+ " <td>0</td>\n",
426
+ " <td>0</td>\n",
427
+ " <td>0</td>\n",
428
+ " <td>0</td>\n",
429
+ " <td>0</td>\n",
430
+ " <td>0</td>\n",
431
+ " <td>0</td>\n",
432
+ " <td>0</td>\n",
433
+ " <td>0</td>\n",
434
+ " <td>0</td>\n",
435
+ " </tr>\n",
436
+ " </tbody>\n",
437
+ "</table>\n",
438
+ "<p>3 rows × 10267 columns</p>\n",
439
+ "</div>"
440
+ ],
441
+ "text/plain": [
442
+ " 00 000 02 06 10 100 1000 100kg 100mg 100x ... zines zinsser \n",
443
+ "0 0 0 0 0 0 0 0 0 0 0 ... 0 0 \\\n",
444
+ "1 0 0 0 0 0 0 0 0 0 0 ... 0 0 \n",
445
+ "2 0 0 0 0 0 0 0 0 0 0 ... 0 0 \n",
446
+ "\n",
447
+ " zip zofran zoloft zombie zone zoo zuko zumba \n",
448
+ "0 0 0 0 0 0 0 0 0 \n",
449
+ "1 0 0 0 0 0 0 0 0 \n",
450
+ "2 0 0 0 0 0 0 0 0 \n",
451
+ "\n",
452
+ "[3 rows x 10267 columns]"
453
+ ]
454
+ },
455
+ "execution_count": 12,
456
+ "metadata": {},
457
+ "output_type": "execute_result"
458
+ }
459
+ ],
460
+ "source": [
461
+ "cv_df = pd.DataFrame(cv_df.toarray(),columns=cv.get_feature_names_out())\n",
462
+ "cv_df.head(3)"
463
+ ]
464
+ },
465
+ {
466
+ "cell_type": "code",
467
+ "execution_count": 13,
468
+ "id": "10526ca5",
469
+ "metadata": {},
470
+ "outputs": [
471
+ {
472
+ "data": {
473
+ "text/plain": [
474
+ "array([[0., 0., 0., ..., 0., 0., 0.],\n",
475
+ " [0., 0., 0., ..., 0., 0., 0.],\n",
476
+ " [0., 0., 0., ..., 0., 0., 0.],\n",
477
+ " ...,\n",
478
+ " [0., 0., 0., ..., 0., 0., 0.],\n",
479
+ " [0., 0., 0., ..., 0., 0., 0.],\n",
480
+ " [0., 0., 0., ..., 0., 0., 0.]])"
481
+ ]
482
+ },
483
+ "execution_count": 13,
484
+ "metadata": {},
485
+ "output_type": "execute_result"
486
+ }
487
+ ],
488
+ "source": [
489
+ "tf = TfidfVectorizer(min_df=MIN_DF)\n",
490
+ "tf_df = tf.fit_transform(df['processed_text'])\n",
491
+ "tf_df.toarray()"
492
+ ]
493
+ },
494
+ {
495
+ "cell_type": "code",
496
+ "execution_count": 14,
497
+ "id": "2de0d172",
498
+ "metadata": {},
499
+ "outputs": [
500
+ {
501
+ "data": {
502
+ "text/plain": [
503
+ "['tfidf_vectorizer.joblib']"
504
+ ]
505
+ },
506
+ "execution_count": 14,
507
+ "metadata": {},
508
+ "output_type": "execute_result"
509
+ }
510
+ ],
511
+ "source": [
512
+ "import joblib\n",
513
+ "joblib.dump(tf, 'tfidf_vectorizer.joblib')"
514
+ ]
515
+ },
516
+ {
517
+ "cell_type": "code",
518
+ "execution_count": 15,
519
+ "id": "869aefb8",
520
+ "metadata": {},
521
+ "outputs": [
522
+ {
523
+ "data": {
524
+ "text/html": [
525
+ "<div>\n",
526
+ "<style scoped>\n",
527
+ " .dataframe tbody tr th:only-of-type {\n",
528
+ " vertical-align: middle;\n",
529
+ " }\n",
530
+ "\n",
531
+ " .dataframe tbody tr th {\n",
532
+ " vertical-align: top;\n",
533
+ " }\n",
534
+ "\n",
535
+ " .dataframe thead th {\n",
536
+ " text-align: right;\n",
537
+ " }\n",
538
+ "</style>\n",
539
+ "<table border=\"1\" class=\"dataframe\">\n",
540
+ " <thead>\n",
541
+ " <tr style=\"text-align: right;\">\n",
542
+ " <th></th>\n",
543
+ " <th>00</th>\n",
544
+ " <th>000</th>\n",
545
+ " <th>02</th>\n",
546
+ " <th>06</th>\n",
547
+ " <th>10</th>\n",
548
+ " <th>100</th>\n",
549
+ " <th>1000</th>\n",
550
+ " <th>100kg</th>\n",
551
+ " <th>100mg</th>\n",
552
+ " <th>100x</th>\n",
553
+ " <th>...</th>\n",
554
+ " <th>zines</th>\n",
555
+ " <th>zinsser</th>\n",
556
+ " <th>zip</th>\n",
557
+ " <th>zofran</th>\n",
558
+ " <th>zoloft</th>\n",
559
+ " <th>zombie</th>\n",
560
+ " <th>zone</th>\n",
561
+ " <th>zoo</th>\n",
562
+ " <th>zuko</th>\n",
563
+ " <th>zumba</th>\n",
564
+ " </tr>\n",
565
+ " </thead>\n",
566
+ " <tbody>\n",
567
+ " <tr>\n",
568
+ " <th>0</th>\n",
569
+ " <td>0.0</td>\n",
570
+ " <td>0.0</td>\n",
571
+ " <td>0.0</td>\n",
572
+ " <td>0.0</td>\n",
573
+ " <td>0.0</td>\n",
574
+ " <td>0.0</td>\n",
575
+ " <td>0.0</td>\n",
576
+ " <td>0.0</td>\n",
577
+ " <td>0.0</td>\n",
578
+ " <td>0.0</td>\n",
579
+ " <td>...</td>\n",
580
+ " <td>0.0</td>\n",
581
+ " <td>0.0</td>\n",
582
+ " <td>0.0</td>\n",
583
+ " <td>0.0</td>\n",
584
+ " <td>0.0</td>\n",
585
+ " <td>0.0</td>\n",
586
+ " <td>0.0</td>\n",
587
+ " <td>0.0</td>\n",
588
+ " <td>0.0</td>\n",
589
+ " <td>0.0</td>\n",
590
+ " </tr>\n",
591
+ " <tr>\n",
592
+ " <th>1</th>\n",
593
+ " <td>0.0</td>\n",
594
+ " <td>0.0</td>\n",
595
+ " <td>0.0</td>\n",
596
+ " <td>0.0</td>\n",
597
+ " <td>0.0</td>\n",
598
+ " <td>0.0</td>\n",
599
+ " <td>0.0</td>\n",
600
+ " <td>0.0</td>\n",
601
+ " <td>0.0</td>\n",
602
+ " <td>0.0</td>\n",
603
+ " <td>...</td>\n",
604
+ " <td>0.0</td>\n",
605
+ " <td>0.0</td>\n",
606
+ " <td>0.0</td>\n",
607
+ " <td>0.0</td>\n",
608
+ " <td>0.0</td>\n",
609
+ " <td>0.0</td>\n",
610
+ " <td>0.0</td>\n",
611
+ " <td>0.0</td>\n",
612
+ " <td>0.0</td>\n",
613
+ " <td>0.0</td>\n",
614
+ " </tr>\n",
615
+ " <tr>\n",
616
+ " <th>2</th>\n",
617
+ " <td>0.0</td>\n",
618
+ " <td>0.0</td>\n",
619
+ " <td>0.0</td>\n",
620
+ " <td>0.0</td>\n",
621
+ " <td>0.0</td>\n",
622
+ " <td>0.0</td>\n",
623
+ " <td>0.0</td>\n",
624
+ " <td>0.0</td>\n",
625
+ " <td>0.0</td>\n",
626
+ " <td>0.0</td>\n",
627
+ " <td>...</td>\n",
628
+ " <td>0.0</td>\n",
629
+ " <td>0.0</td>\n",
630
+ " <td>0.0</td>\n",
631
+ " <td>0.0</td>\n",
632
+ " <td>0.0</td>\n",
633
+ " <td>0.0</td>\n",
634
+ " <td>0.0</td>\n",
635
+ " <td>0.0</td>\n",
636
+ " <td>0.0</td>\n",
637
+ " <td>0.0</td>\n",
638
+ " </tr>\n",
639
+ " </tbody>\n",
640
+ "</table>\n",
641
+ "<p>3 rows × 10267 columns</p>\n",
642
+ "</div>"
643
+ ],
644
+ "text/plain": [
645
+ " 00 000 02 06 10 100 1000 100kg 100mg 100x ... zines \n",
646
+ "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 \\\n",
647
+ "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 \n",
648
+ "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 \n",
649
+ "\n",
650
+ " zinsser zip zofran zoloft zombie zone zoo zuko zumba \n",
651
+ "0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
652
+ "1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
653
+ "2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 \n",
654
+ "\n",
655
+ "[3 rows x 10267 columns]"
656
+ ]
657
+ },
658
+ "execution_count": 15,
659
+ "metadata": {},
660
+ "output_type": "execute_result"
661
+ }
662
+ ],
663
+ "source": [
664
+ "tf_df = pd.DataFrame(tf_df.toarray(),columns=tf.get_feature_names_out())\n",
665
+ "tf_df.head(3)"
666
+ ]
667
+ },
668
+ {
669
+ "cell_type": "code",
670
+ "execution_count": 16,
671
+ "id": "95bddea7",
672
+ "metadata": {},
673
+ "outputs": [
674
+ {
675
+ "data": {
676
+ "text/html": [
677
+ "<div>\n",
678
+ "<style scoped>\n",
679
+ " .dataframe tbody tr th:only-of-type {\n",
680
+ " vertical-align: middle;\n",
681
+ " }\n",
682
+ "\n",
683
+ " .dataframe tbody tr th {\n",
684
+ " vertical-align: top;\n",
685
+ " }\n",
686
+ "\n",
687
+ " .dataframe thead th {\n",
688
+ " text-align: right;\n",
689
+ " }\n",
690
+ "</style>\n",
691
+ "<table border=\"1\" class=\"dataframe\">\n",
692
+ " <thead>\n",
693
+ " <tr style=\"text-align: right;\">\n",
694
+ " <th></th>\n",
695
+ " <th>00</th>\n",
696
+ " <th>000</th>\n",
697
+ " <th>02</th>\n",
698
+ " <th>06</th>\n",
699
+ " <th>10</th>\n",
700
+ " <th>100</th>\n",
701
+ " <th>1000</th>\n",
702
+ " <th>100kg</th>\n",
703
+ " <th>100mg</th>\n",
704
+ " <th>100x</th>\n",
705
+ " <th>...</th>\n",
706
+ " <th>zines</th>\n",
707
+ " <th>zinsser</th>\n",
708
+ " <th>zip</th>\n",
709
+ " <th>zofran</th>\n",
710
+ " <th>zoloft</th>\n",
711
+ " <th>zombie</th>\n",
712
+ " <th>zone</th>\n",
713
+ " <th>zoo</th>\n",
714
+ " <th>zuko</th>\n",
715
+ " <th>zumba</th>\n",
716
+ " </tr>\n",
717
+ " </thead>\n",
718
+ " <tbody>\n",
719
+ " <tr>\n",
720
+ " <th>count</th>\n",
721
+ " <td>2838.000000</td>\n",
722
+ " <td>2838.000000</td>\n",
723
+ " <td>2838.000000</td>\n",
724
+ " <td>2838.000000</td>\n",
725
+ " <td>2838.000000</td>\n",
726
+ " <td>2838.000000</td>\n",
727
+ " <td>2838.000000</td>\n",
728
+ " <td>2838.000000</td>\n",
729
+ " <td>2838.000000</td>\n",
730
+ " <td>2838.000000</td>\n",
731
+ " <td>...</td>\n",
732
+ " <td>2838.000000</td>\n",
733
+ " <td>2838.000000</td>\n",
734
+ " <td>2838.000000</td>\n",
735
+ " <td>2838.000000</td>\n",
736
+ " <td>2838.000000</td>\n",
737
+ " <td>2838.000000</td>\n",
738
+ " <td>2838.000000</td>\n",
739
+ " <td>2838.000000</td>\n",
740
+ " <td>2838.000000</td>\n",
741
+ " <td>2838.000000</td>\n",
742
+ " </tr>\n",
743
+ " <tr>\n",
744
+ " <th>mean</th>\n",
745
+ " <td>0.000452</td>\n",
746
+ " <td>0.000548</td>\n",
747
+ " <td>0.000109</td>\n",
748
+ " <td>0.000069</td>\n",
749
+ " <td>0.003342</td>\n",
750
+ " <td>0.001784</td>\n",
751
+ " <td>0.000576</td>\n",
752
+ " <td>0.000106</td>\n",
753
+ " <td>0.000115</td>\n",
754
+ " <td>0.000067</td>\n",
755
+ " <td>...</td>\n",
756
+ " <td>0.000078</td>\n",
757
+ " <td>0.000072</td>\n",
758
+ " <td>0.000204</td>\n",
759
+ " <td>0.000078</td>\n",
760
+ " <td>0.000715</td>\n",
761
+ " <td>0.000126</td>\n",
762
+ " <td>0.000245</td>\n",
763
+ " <td>0.000089</td>\n",
764
+ " <td>0.000054</td>\n",
765
+ " <td>0.000040</td>\n",
766
+ " </tr>\n",
767
+ " <tr>\n",
768
+ " <th>std</th>\n",
769
+ " <td>0.011158</td>\n",
770
+ " <td>0.009998</td>\n",
771
+ " <td>0.005801</td>\n",
772
+ " <td>0.003671</td>\n",
773
+ " <td>0.021379</td>\n",
774
+ " <td>0.017215</td>\n",
775
+ " <td>0.011156</td>\n",
776
+ " <td>0.005624</td>\n",
777
+ " <td>0.004534</td>\n",
778
+ " <td>0.003573</td>\n",
779
+ " <td>...</td>\n",
780
+ " <td>0.004145</td>\n",
781
+ " <td>0.003841</td>\n",
782
+ " <td>0.007786</td>\n",
783
+ " <td>0.004157</td>\n",
784
+ " <td>0.011797</td>\n",
785
+ " <td>0.004733</td>\n",
786
+ " <td>0.006851</td>\n",
787
+ " <td>0.004754</td>\n",
788
+ " <td>0.002873</td>\n",
789
+ " <td>0.002105</td>\n",
790
+ " </tr>\n",
791
+ " <tr>\n",
792
+ " <th>min</th>\n",
793
+ " <td>0.000000</td>\n",
794
+ " <td>0.000000</td>\n",
795
+ " <td>0.000000</td>\n",
796
+ " <td>0.000000</td>\n",
797
+ " <td>0.000000</td>\n",
798
+ " <td>0.000000</td>\n",
799
+ " <td>0.000000</td>\n",
800
+ " <td>0.000000</td>\n",
801
+ " <td>0.000000</td>\n",
802
+ " <td>0.000000</td>\n",
803
+ " <td>...</td>\n",
804
+ " <td>0.000000</td>\n",
805
+ " <td>0.000000</td>\n",
806
+ " <td>0.000000</td>\n",
807
+ " <td>0.000000</td>\n",
808
+ " <td>0.000000</td>\n",
809
+ " <td>0.000000</td>\n",
810
+ " <td>0.000000</td>\n",
811
+ " <td>0.000000</td>\n",
812
+ " <td>0.000000</td>\n",
813
+ " <td>0.000000</td>\n",
814
+ " </tr>\n",
815
+ " <tr>\n",
816
+ " <th>25%</th>\n",
817
+ " <td>0.000000</td>\n",
818
+ " <td>0.000000</td>\n",
819
+ " <td>0.000000</td>\n",
820
+ " <td>0.000000</td>\n",
821
+ " <td>0.000000</td>\n",
822
+ " <td>0.000000</td>\n",
823
+ " <td>0.000000</td>\n",
824
+ " <td>0.000000</td>\n",
825
+ " <td>0.000000</td>\n",
826
+ " <td>0.000000</td>\n",
827
+ " <td>...</td>\n",
828
+ " <td>0.000000</td>\n",
829
+ " <td>0.000000</td>\n",
830
+ " <td>0.000000</td>\n",
831
+ " <td>0.000000</td>\n",
832
+ " <td>0.000000</td>\n",
833
+ " <td>0.000000</td>\n",
834
+ " <td>0.000000</td>\n",
835
+ " <td>0.000000</td>\n",
836
+ " <td>0.000000</td>\n",
837
+ " <td>0.000000</td>\n",
838
+ " </tr>\n",
839
+ " <tr>\n",
840
+ " <th>50%</th>\n",
841
+ " <td>0.000000</td>\n",
842
+ " <td>0.000000</td>\n",
843
+ " <td>0.000000</td>\n",
844
+ " <td>0.000000</td>\n",
845
+ " <td>0.000000</td>\n",
846
+ " <td>0.000000</td>\n",
847
+ " <td>0.000000</td>\n",
848
+ " <td>0.000000</td>\n",
849
+ " <td>0.000000</td>\n",
850
+ " <td>0.000000</td>\n",
851
+ " <td>...</td>\n",
852
+ " <td>0.000000</td>\n",
853
+ " <td>0.000000</td>\n",
854
+ " <td>0.000000</td>\n",
855
+ " <td>0.000000</td>\n",
856
+ " <td>0.000000</td>\n",
857
+ " <td>0.000000</td>\n",
858
+ " <td>0.000000</td>\n",
859
+ " <td>0.000000</td>\n",
860
+ " <td>0.000000</td>\n",
861
+ " <td>0.000000</td>\n",
862
+ " </tr>\n",
863
+ " <tr>\n",
864
+ " <th>75%</th>\n",
865
+ " <td>0.000000</td>\n",
866
+ " <td>0.000000</td>\n",
867
+ " <td>0.000000</td>\n",
868
+ " <td>0.000000</td>\n",
869
+ " <td>0.000000</td>\n",
870
+ " <td>0.000000</td>\n",
871
+ " <td>0.000000</td>\n",
872
+ " <td>0.000000</td>\n",
873
+ " <td>0.000000</td>\n",
874
+ " <td>0.000000</td>\n",
875
+ " <td>...</td>\n",
876
+ " <td>0.000000</td>\n",
877
+ " <td>0.000000</td>\n",
878
+ " <td>0.000000</td>\n",
879
+ " <td>0.000000</td>\n",
880
+ " <td>0.000000</td>\n",
881
+ " <td>0.000000</td>\n",
882
+ " <td>0.000000</td>\n",
883
+ " <td>0.000000</td>\n",
884
+ " <td>0.000000</td>\n",
885
+ " <td>0.000000</td>\n",
886
+ " </tr>\n",
887
+ " <tr>\n",
888
+ " <th>max</th>\n",
889
+ " <td>0.348349</td>\n",
890
+ " <td>0.327600</td>\n",
891
+ " <td>0.309059</td>\n",
892
+ " <td>0.195550</td>\n",
893
+ " <td>0.259369</td>\n",
894
+ " <td>0.267281</td>\n",
895
+ " <td>0.310333</td>\n",
896
+ " <td>0.299611</td>\n",
897
+ " <td>0.215011</td>\n",
898
+ " <td>0.190366</td>\n",
899
+ " <td>...</td>\n",
900
+ " <td>0.220793</td>\n",
901
+ " <td>0.204637</td>\n",
902
+ " <td>0.336077</td>\n",
903
+ " <td>0.221471</td>\n",
904
+ " <td>0.306537</td>\n",
905
+ " <td>0.183347</td>\n",
906
+ " <td>0.268149</td>\n",
907
+ " <td>0.253283</td>\n",
908
+ " <td>0.153067</td>\n",
909
+ " <td>0.112136</td>\n",
910
+ " </tr>\n",
911
+ " </tbody>\n",
912
+ "</table>\n",
913
+ "<p>8 rows × 10267 columns</p>\n",
914
+ "</div>"
915
+ ],
916
+ "text/plain": [
917
+ " 00 000 02 06 10 \n",
918
+ "count 2838.000000 2838.000000 2838.000000 2838.000000 2838.000000 \\\n",
919
+ "mean 0.000452 0.000548 0.000109 0.000069 0.003342 \n",
920
+ "std 0.011158 0.009998 0.005801 0.003671 0.021379 \n",
921
+ "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
922
+ "25% 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
923
+ "50% 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
924
+ "75% 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
925
+ "max 0.348349 0.327600 0.309059 0.195550 0.259369 \n",
926
+ "\n",
927
+ " 100 1000 100kg 100mg 100x ... \n",
928
+ "count 2838.000000 2838.000000 2838.000000 2838.000000 2838.000000 ... \\\n",
929
+ "mean 0.001784 0.000576 0.000106 0.000115 0.000067 ... \n",
930
+ "std 0.017215 0.011156 0.005624 0.004534 0.003573 ... \n",
931
+ "min 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n",
932
+ "25% 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n",
933
+ "50% 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n",
934
+ "75% 0.000000 0.000000 0.000000 0.000000 0.000000 ... \n",
935
+ "max 0.267281 0.310333 0.299611 0.215011 0.190366 ... \n",
936
+ "\n",
937
+ " zines zinsser zip zofran zoloft \n",
938
+ "count 2838.000000 2838.000000 2838.000000 2838.000000 2838.000000 \\\n",
939
+ "mean 0.000078 0.000072 0.000204 0.000078 0.000715 \n",
940
+ "std 0.004145 0.003841 0.007786 0.004157 0.011797 \n",
941
+ "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
942
+ "25% 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
943
+ "50% 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
944
+ "75% 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
945
+ "max 0.220793 0.204637 0.336077 0.221471 0.306537 \n",
946
+ "\n",
947
+ " zombie zone zoo zuko zumba \n",
948
+ "count 2838.000000 2838.000000 2838.000000 2838.000000 2838.000000 \n",
949
+ "mean 0.000126 0.000245 0.000089 0.000054 0.000040 \n",
950
+ "std 0.004733 0.006851 0.004754 0.002873 0.002105 \n",
951
+ "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
952
+ "25% 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
953
+ "50% 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
954
+ "75% 0.000000 0.000000 0.000000 0.000000 0.000000 \n",
955
+ "max 0.183347 0.268149 0.253283 0.153067 0.112136 \n",
956
+ "\n",
957
+ "[8 rows x 10267 columns]"
958
+ ]
959
+ },
960
+ "execution_count": 16,
961
+ "metadata": {},
962
+ "output_type": "execute_result"
963
+ }
964
+ ],
965
+ "source": [
966
+ "tf_df.describe()\n"
967
+ ]
968
+ },
969
+ {
970
+ "cell_type": "code",
971
+ "execution_count": 17,
972
+ "id": "b31e122a",
973
+ "metadata": {},
974
+ "outputs": [
975
+ {
976
+ "data": {
977
+ "text/plain": [
978
+ "(2838, 10267)"
979
+ ]
980
+ },
981
+ "execution_count": 17,
982
+ "metadata": {},
983
+ "output_type": "execute_result"
984
+ }
985
+ ],
986
+ "source": [
987
+ "tf_df.shape\n"
988
+ ]
989
+ },
990
+ {
991
+ "cell_type": "code",
992
+ "execution_count": 18,
993
+ "id": "3fabc741",
994
+ "metadata": {},
995
+ "outputs": [],
996
+ "source": [
997
+ "from sklearn.model_selection import train_test_split\n",
998
+ "from sklearn.linear_model import LogisticRegression"
999
+ ]
1000
+ },
1001
+ {
1002
+ "cell_type": "code",
1003
+ "execution_count": 19,
1004
+ "id": "2fce3cf9",
1005
+ "metadata": {},
1006
+ "outputs": [],
1007
+ "source": [
1008
+ "import warnings\n",
1009
+ "warnings.filterwarnings('ignore')"
1010
+ ]
1011
+ },
1012
+ {
1013
+ "cell_type": "code",
1014
+ "execution_count": 20,
1015
+ "id": "a40e9acb",
1016
+ "metadata": {},
1017
+ "outputs": [
1018
+ {
1019
+ "data": {
1020
+ "text/plain": [
1021
+ "((2128, 10267), (710,))"
1022
+ ]
1023
+ },
1024
+ "execution_count": 20,
1025
+ "metadata": {},
1026
+ "output_type": "execute_result"
1027
+ }
1028
+ ],
1029
+ "source": [
1030
+ "X_train,X_test,y_train,y_test = train_test_split(cv_df,df['label'],stratify=df['label'])\n",
1031
+ "X_train.shape,y_test.shape"
1032
+ ]
1033
+ },
1034
+ {
1035
+ "cell_type": "code",
1036
+ "execution_count": 21,
1037
+ "id": "3b57e047",
1038
+ "metadata": {},
1039
+ "outputs": [
1040
+ {
1041
+ "data": {
1042
+ "text/plain": [
1043
+ "(0.9976503759398496, 0.7619718309859155)"
1044
+ ]
1045
+ },
1046
+ "execution_count": 21,
1047
+ "metadata": {},
1048
+ "output_type": "execute_result"
1049
+ }
1050
+ ],
1051
+ "source": [
1052
+ "model_lr = LogisticRegression().fit(X_train,y_train)\n",
1053
+ "model_lr.score(X_train,y_train),model_lr.score(X_test,y_test)"
1054
+ ]
1055
+ },
1056
+ {
1057
+ "cell_type": "code",
1058
+ "execution_count": 22,
1059
+ "id": "999b308a",
1060
+ "metadata": {},
1061
+ "outputs": [
1062
+ {
1063
+ "data": {
1064
+ "text/plain": [
1065
+ "((2128, 10267), (710,))"
1066
+ ]
1067
+ },
1068
+ "execution_count": 22,
1069
+ "metadata": {},
1070
+ "output_type": "execute_result"
1071
+ }
1072
+ ],
1073
+ "source": [
1074
+ "X_train1,X_test1,y_train1,y_test1 = train_test_split(tf_df,df['label'],stratify=df['label'])\n",
1075
+ "X_train1.shape,y_test1.shape"
1076
+ ]
1077
+ },
1078
+ {
1079
+ "cell_type": "code",
1080
+ "execution_count": 23,
1081
+ "id": "0017098c",
1082
+ "metadata": {},
1083
+ "outputs": [
1084
+ {
1085
+ "data": {
1086
+ "text/plain": [
1087
+ "(0.9060150375939849, 0.7788732394366197)"
1088
+ ]
1089
+ },
1090
+ "execution_count": 23,
1091
+ "metadata": {},
1092
+ "output_type": "execute_result"
1093
+ }
1094
+ ],
1095
+ "source": [
1096
+ "model_lr = LogisticRegression().fit(X_train1,y_train1)\n",
1097
+ "model_lr.score(X_train1,y_train1),model_lr.score(X_test1,y_test1)"
1098
+ ]
1099
+ },
1100
+ {
1101
+ "cell_type": "code",
1102
+ "execution_count": 24,
1103
+ "id": "12ec4b9d",
1104
+ "metadata": {},
1105
+ "outputs": [
1106
+ {
1107
+ "data": {
1108
+ "text/plain": [
1109
+ "0.9002818886539817"
1110
+ ]
1111
+ },
1112
+ "execution_count": 24,
1113
+ "metadata": {},
1114
+ "output_type": "execute_result"
1115
+ }
1116
+ ],
1117
+ "source": [
1118
+ "model = LogisticRegression().fit(tf_df,df['label'])\n",
1119
+ "model.score(tf_df,df['label'])"
1120
+ ]
1121
+ },
1122
+ {
1123
+ "cell_type": "code",
1124
+ "execution_count": 25,
1125
+ "id": "0555b0d5",
1126
+ "metadata": {},
1127
+ "outputs": [],
1128
+ "source": [
1129
+ "def predictor(text):\n",
1130
+ " processed = textPocess(text)\n",
1131
+ " embedded_words = tf.transform([text])\n",
1132
+ " res = model.predict(embedded_words)\n",
1133
+ " if res[0] == 1:\n",
1134
+ " res = \"this person is in stress\"\n",
1135
+ " else:\n",
1136
+ " res = \"this person is not in stress\"\n",
1137
+ " return res"
1138
+ ]
1139
+ },
1140
+ {
1141
+ "cell_type": "code",
1142
+ "execution_count": 26,
1143
+ "id": "2ad39b0f",
1144
+ "metadata": {},
1145
+ "outputs": [
1146
+ {
1147
+ "data": {
1148
+ "text/plain": [
1149
+ "['Stress identification NLP']"
1150
+ ]
1151
+ },
1152
+ "execution_count": 26,
1153
+ "metadata": {},
1154
+ "output_type": "execute_result"
1155
+ }
1156
+ ],
1157
+ "source": [
1158
+ "import joblib\n",
1159
+ "joblib.dump(model,\"Stress identification NLP\")"
1160
+ ]
1161
+ },
1162
+ {
1163
+ "cell_type": "code",
1164
+ "execution_count": 27,
1165
+ "id": "24f77186",
1166
+ "metadata": {},
1167
+ "outputs": [],
1168
+ "source": [
1169
+ "text = \"feeling wonderful\""
1170
+ ]
1171
+ },
1172
+ {
1173
+ "cell_type": "code",
1174
+ "execution_count": 28,
1175
+ "id": "9d8c2171",
1176
+ "metadata": {},
1177
+ "outputs": [
1178
+ {
1179
+ "name": "stdout",
1180
+ "output_type": "stream",
1181
+ "text": [
1182
+ "this person is not in stress\n"
1183
+ ]
1184
+ }
1185
+ ],
1186
+ "source": [
1187
+ "print(predictor(text))"
1188
+ ]
1189
+ },
1190
+ {
1191
+ "cell_type": "code",
1192
+ "execution_count": null,
1193
+ "id": "5b3b9972",
1194
+ "metadata": {},
1195
+ "outputs": [],
1196
+ "source": []
1197
+ }
1198
+ ],
1199
+ "metadata": {
1200
+ "kernelspec": {
1201
+ "display_name": "Python 3 (ipykernel)",
1202
+ "language": "python",
1203
+ "name": "python3"
1204
+ },
1205
+ "language_info": {
1206
+ "codemirror_mode": {
1207
+ "name": "ipython",
1208
+ "version": 3
1209
+ },
1210
+ "file_extension": ".py",
1211
+ "mimetype": "text/x-python",
1212
+ "name": "python",
1213
+ "nbconvert_exporter": "python",
1214
+ "pygments_lexer": "ipython3",
1215
+ "version": "3.8.16"
1216
+ }
1217
+ },
1218
+ "nbformat": 4,
1219
+ "nbformat_minor": 5
1220
+ }
app.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import joblib
3
+ import numpy as np
4
+ from sklearn.feature_extraction.text import TfidfVectorizer
5
+ # Import necessary libraries
6
+ import re
7
+ from urllib.parse import urlparse
8
+ from nltk.tokenize import word_tokenize
9
+ from nltk.corpus import stopwords
10
+ from nltk.stem import WordNetLemmatizer
11
+
12
+ # Initialize NLTK resources
13
+ stop_words = set(stopwords.words("english")) # Create a set of English stopwords
14
+ lemmatizer = WordNetLemmatizer() # Initialize the WordNet Lemmatizer
15
+
16
+ # Define a function for text processing
17
+ def textProcess(sent):
18
+ try:
19
+ if sent is None: # Check if the input is None
20
+ return "" # Return an empty string if input is None
21
+
22
+ # Remove square brackets, parentheses, and other special characters
23
+ sent = re.sub('[][)(]', ' ', sent)
24
+
25
+ # Tokenize the text into words
26
+ sent = [word for word in sent.split() if not urlparse(word).scheme]
27
+
28
+ # Join the words back into a sentence
29
+ sent = ' '.join(sent)
30
+
31
+ # Remove Twitter usernames (words starting with @)
32
+ sent = re.sub(r'\@\w+', '', sent)
33
+
34
+ # Remove HTML tags using regular expression
35
+ sent = re.sub(re.compile("<.*?>"), '', sent)
36
+
37
+ # Remove non-alphanumeric characters (keep only letters and numbers)
38
+ sent = re.sub("[^A-Za-z0-9]", ' ', sent)
39
+
40
+ # Convert text to lowercase
41
+ sent = sent.lower()
42
+
43
+ # Split the text into words, strip whitespace, and join them back into a sentence
44
+ sent = [word.strip() for word in sent.split()]
45
+ sent = ' '.join(sent)
46
+
47
+ # Tokenize the text again
48
+ tokens = word_tokenize(sent)
49
+
50
+ # Remove stop words
51
+ for word in tokens.copy():
52
+ if word in stop_words:
53
+ tokens.remove(word)
54
+
55
+ # Lemmatize the remaining words
56
+ sent = [lemmatizer.lemmatize(word) for word in tokens]
57
+
58
+ # Join the lemmatized words back into a sentence
59
+ sent = ' '.join(sent)
60
+
61
+ # Return the processed text
62
+ return sent
63
+
64
+ except Exception as ex:
65
+ print(sent, "\n")
66
+ print("Error ", ex)
67
+ return "" # Return an empty string in case of an error
68
+
69
+ # Rest of your code...
70
+
71
+ # Load the pre-trained model from joblib
72
+ model = joblib.load('Stress identification NLP')
73
+
74
+ # Load the TF-IDF vectorizer used during training
75
+ tfidf_vectorizer = joblib.load('tfidf_vectorizer.joblib')
76
+
77
+ # Define the Streamlit web app
78
+ def main():
79
+ st.title("Stress Predictor Web App")
80
+ st.write("Enter some text to predict if the person is in stress or not.")
81
+
82
+ # Input text box
83
+ user_input = st.text_area("Enter text here:")
84
+
85
+ if st.button("Predict"):
86
+ if user_input:
87
+ # Process the input text
88
+ processed_text = textProcess(user_input)
89
+
90
+ # Use the same TF-IDF vectorizer to transform the input text
91
+ tfidf_text = tfidf_vectorizer.transform([processed_text])
92
+
93
+ # Make predictions using the loaded model
94
+ prediction = model.predict(tfidf_text)[0]
95
+
96
+ if prediction == 1:
97
+ result = "This person is in stress."
98
+ else:
99
+ result = "This person is not in stress."
100
+
101
+ st.write(result)
102
+
103
+ if __name__ == '__main__':
104
+ main()
tfidf_vectorizer.joblib ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b1411ab31d137fe43a4bda6ea175f16f706714b0ee14ea779455d00bd2f7df5d
3
+ size 298535