eduar03yauri
commited on
Commit
•
c35a712
1
Parent(s):
357d236
Update README.md
Browse files
README.md
CHANGED
@@ -19,15 +19,18 @@ tags:
|
|
19 |
|
20 |
## Description
|
21 |
|
22 |
-
Sent2vec can be used directly for English texts.
|
23 |
-
|
24 |
-
|
25 |
-
|
|
|
26 |
A total of 192,209 sentences are available for training.
|
27 |
-
- Apply a
|
28 |
-
- the
|
29 |
-
-
|
30 |
-
|
|
|
|
|
31 |
|
32 |
## How to use
|
33 |
|
|
|
19 |
|
20 |
## Description
|
21 |
|
22 |
+
Sent2vec can be used directly for English texts. For this purpose, all you have to do is download the library and enter the text to be coded, since most
|
23 |
+
of these algorithms were trained using English as the original language. However, since this work is used with text in Spanish, it has been necessary
|
24 |
+
to train it from zero in this new language. This training was carried out using the generated corpus ([in this respository](https://huggingface.co/datasets/oeg/CelebA_Sent2Vect_Sp))
|
25 |
+
with the following process:
|
26 |
+
- A corpus composed of a set of descriptive sentences of characteristics of each of the faces of the CelebA dataset in Spanish has been generated.
|
27 |
A total of 192,209 sentences are available for training.
|
28 |
+
- Apply a pre-processing consisting of removing accents. _stopwords_ and connectors were retained as part of the sentence structure during training.
|
29 |
+
- Install the libraries _Sent2vec_ and _FastText_, and configure the parameters. The parameters have been fixed empirically after several
|
30 |
+
- tests, being: 4,800 dimensions of feature vectors, 5,000 epochs, 200 threads, 2 n-grams and a learning rate of 0.05.
|
31 |
+
|
32 |
+
In this context, the total training time lasted 7 hours working with all CPUs at maximum performance.
|
33 |
+
As a result, it generates a _bin_ extension file which can be downloaded from this repository.
|
34 |
|
35 |
## How to use
|
36 |
|