oeg
/

Spanish
CelebA
Spanish
celebFaces Attributes
eduar03yauri commited on
Commit
c35a712
1 Parent(s): 357d236

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -8
README.md CHANGED
@@ -19,15 +19,18 @@ tags:
19
 
20
  ## Description
21
 
22
- Sent2vec can be used directly for English texts. However, since this work is used with Spanish text, it has been necessary to train it
23
- previously using the generated corpus ([in this respository](https://huggingface.co/datasets/oeg/CelebA_Sent2Vect_Sp)) with the following process:
24
- - Initial preprocessing of the Spanish corpus. For this purpose, a new file has been developed in which each of the entries of the original
25
- corpus is saved and the other components, such as the names of the image it describes and symbols, are removed.
 
26
  A total of 192,209 sentences are available for training.
27
- - Apply a second pre-processing consisting of removing accents. _stopwords_ and connectors were retained as part of
28
- - the sentence structure during training.
29
- - Configure the libraries, e.g., _Sent2vec_ and _FastText_, and the parameters. The parameters have been set empirically,
30
- being: 4,800 feature vector dimension, 5,000 epochs, 200 threads, 2 n-grams, and 0.05 learning rate.
 
 
31
 
32
  ## How to use
33
 
 
19
 
20
  ## Description
21
 
22
+ Sent2vec can be used directly for English texts. For this purpose, all you have to do is download the library and enter the text to be coded, since most
23
+ of these algorithms were trained using English as the original language. However, since this work is used with text in Spanish, it has been necessary
24
+ to train it from zero in this new language. This training was carried out using the generated corpus ([in this respository](https://huggingface.co/datasets/oeg/CelebA_Sent2Vect_Sp))
25
+ with the following process:
26
+ - A corpus composed of a set of descriptive sentences of characteristics of each of the faces of the CelebA dataset in Spanish has been generated.
27
  A total of 192,209 sentences are available for training.
28
+ - Apply a pre-processing consisting of removing accents. _stopwords_ and connectors were retained as part of the sentence structure during training.
29
+ - Install the libraries _Sent2vec_ and _FastText_, and configure the parameters. The parameters have been fixed empirically after several
30
+ - tests, being: 4,800 dimensions of feature vectors, 5,000 epochs, 200 threads, 2 n-grams and a learning rate of 0.05.
31
+
32
+ In this context, the total training time lasted 7 hours working with all CPUs at maximum performance.
33
+ As a result, it generates a _bin_ extension file which can be downloaded from this repository.
34
 
35
  ## How to use
36