Mark7549 commited on
Commit
01b39fb
·
1 Parent(s): 9ee85ae

updated FAQ

Browse files
Files changed (1) hide show
  1. app.py +158 -12
app.py CHANGED
@@ -104,7 +104,7 @@ st.markdown(
104
  with st.sidebar:
105
  st.image('images/AGALMA_logo_v2.png')
106
  # st.markdown('# ἄγαλμα | AGALMA')
107
- selected = option_menu('ἄγαλμα | AGALMA', ["App", "About", "FAQ", "License"],
108
  menu_icon="menu", default_index=0, orientation="vertical", styles=styles_vertical)
109
 
110
  if selected == "App":
@@ -404,22 +404,168 @@ if selected == "FAQ":
404
 
405
 
406
 
407
- with st.expander(r"$\textsf{\Large Which models is this interface based on?}$"):
408
  st.write(
409
- "This interface is based on five language models. \
410
- Language models are statistical models of language, \
411
- which store statistical information about word co-occurrence during the training phase. \
412
- During training they process a corpus of texts in the target language(s). \
413
- Once trained, models can be used to extract information about the language \
414
- (in this interface, we focus on the extraction of semantic information) or to perform specific linguistic tasks. \
415
- The models on which this interface is based are Word Embedding models."
 
416
  )
417
 
 
 
 
 
 
 
 
 
 
 
 
418
  with st.expander(r"$\textsf{\Large Which corpus was used to train the models?}$"):
 
 
 
 
 
 
 
 
 
 
419
  st.write(
420
- "The five models on which this interface is based were trained on five slices of the Diorisis Ancient Greek Corpus (Vatri & McGillivray 2018)."
421
- )
422
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
423
 
424
  if selected == "License":
425
  st.markdown("""
 
104
  with st.sidebar:
105
  st.image('images/AGALMA_logo_v2.png')
106
  # st.markdown('# ἄγαλμα | AGALMA')
107
+ selected = option_menu('ἄγαλμα | AGALMA', ["App", "About", "FAQ", "Subcorpora", "License"],
108
  menu_icon="menu", default_index=0, orientation="vertical", styles=styles_vertical)
109
 
110
  if selected == "App":
 
404
 
405
 
406
 
407
+ with st.expander(r"$\textsf{\Large What is this interface based on?}$"):
408
  st.write(
409
+ "This interface is based on language models. Language models are probability distributions of \
410
+ words or word sequences, which store statistical information about word co-occurrences. \
411
+ This happens during the training phase, in which models process a corpus of texts in the \
412
+ target language(s). Once trained, linguistic information can be extracted from the models, or \
413
+ the models can be used to perform specific linguistic tasks. In this interface, we focus on the \
414
+ extraction of semantic information. To that end, we created five models, corresponding to five \
415
+ time slices. The models on which this interface is based are so-called Word Embedding \
416
+ models (the specific architecture is called Word2Vec)."
417
  )
418
 
419
+ with st.expander(r"$\textsf{\Large What are Word Embeddings?}$"):
420
+ st.write(
421
+ "Word Embeddings are representations of words obtained via language modelling. More in \
422
+ detail, they are strings of numbers (called *vectors*) produced by a language model to \
423
+ represent each word in the training corpus in a multi-dimensional space. Words that are more \
424
+ similar in meaning will be closer to one another in this vector space (or semantic space) than \
425
+ words that are less similar in meaning. The term *word embeddings* is often used as a \
426
+ synonym of *predict models*, a type of language models introduced by Mikolov *et al.* (2013) \
427
+ with the Word2Vec architecture. This interface is built upon Word2Vec models."
428
+ )
429
+
430
  with st.expander(r"$\textsf{\Large Which corpus was used to train the models?}$"):
431
+ st.markdown('''
432
+ The five models on which this interface is based were trained on five diachronic slices of the \
433
+ Diorisis Ancient Greek Corpus, which is ‘a digital collection of ancient Greek texts (from \
434
+ Homer to the early fifth century AD) compiled for linguistic analyses’ (Vatri & McGillivray \
435
+ 2018: 55). The Diorisis corpus contains a subset of the texts that can be found in the \
436
+ Thesaurus Linguae Graecae. More information about the works and authors included in each \
437
+ subcorpus is [here] '''
438
+ )
439
+
440
+ with st.expander(r"$\textsf{\Large How was the corpus divided into time slices?}$"):
441
  st.write(
442
+ "The texts in the corpus were divided according to chronology. We tried to strike a balance \
443
+ between respecting the traditional divisions of Ancient Greek literature into periods and \
444
+ having slices of a more or less comparable size. The division is the following: \
445
+ \
446
+ Archaic: beginning-500 BCE; Classical: 499-324 BCE; Hellenistic: 323-0 BCE, Early Roman: \
447
+ 1-250 CE; Late Roman: 251-500 CE."
448
+ )
449
+
450
+ with st.expander(r"$\textsf{\Large Which are the theoretical assumptions behind distributional semantic models, such as Word Embeddings?}$"):
451
+ st.write(
452
+ "Computational semantics is based on the Distributional Hypothesis. According to this \
453
+ hypothesis, words used in similar lexical contexts (contexts of words surrounding them) will \
454
+ have a similar meaning. This hypothesis was famously summarized by J.R. Firth as ‘you \
455
+ shall know a word by the company it keeps’ (1957: xx). Phrased differently, this \
456
+ means that two words that occur in similar lexical contexts are probably semantically \
457
+ related. The words that occur in the most similar lexical contexts are referred to as \
458
+ nearest neighbours. This does not necessarily mean, though, that these words even \
459
+ occur together. A detailed introduction to distributional semantics can be found in the book \
460
+ *Distributional Semantics* (Lenci & Sahlgren 2023: 3-25)."
461
+ )
462
+
463
+ with st.expander(r"$\textsf{\Large What are the nearest neighbours?}$"):
464
+ st.write(
465
+ "Word vectors can be used as coordinates to represent words in a geometric space, called \
466
+ *semantic space*. Words with similar vectors, occurring in similar contexts, are closer in the \
467
+ space. The nearest neighbours to a word are the closest words to it in the semantic space. \
468
+ Words close in the space are not necessarily synonyms, they are rather in a relationship of \
469
+ semantic relatedness, i.e. they belong to the same semantic area. An example of neighbours \
470
+ in the space could be: *star – moon – sun – cloud – plane – fly – blue*."
471
+ )
472
+
473
+ with st.expander(r"$\textsf{\Large Are the nearest neighbours the same as concordances?}$"):
474
+ st.write(
475
+ "No. The nearest neighbours to a target word do not necessarily occur together with it in the \
476
+ same context, but each of them will be found in similar lexical contexts. For example, my \
477
+ colleague Pete and I may often go to the same type of conferences and meet the same \
478
+ group of people there, but it is quite possible that Pete and I never go to the same \
479
+ conference at the same time. Pete and I are similar, but not necessarily spending time \
480
+ together. The extraction of the nearest neighbours with word embeddings is thus different \
481
+ from finding concordances. The nearest neighbours cannot be extracted manually with close- \
482
+ reading methods."
483
+ )
484
+
485
+ with st.expander(r"$\textsf{\Large Which framework and parameters were used to train the models?}$"):
486
+ st.write(
487
+ "The Word2vec models were trained by using the CADE framework (Bianchi *et al.* 2020), a \
488
+ technique which does not require space alignment, i.e. word embeddings trained on different \
489
+ corpus slices are directly comparable. CADE was used with the following parameters: \
490
+ size=30, siter=5, diter=5, workers=4, sg=0, ns=20. The chosen architecture was the \
491
+ Continuous-Bag-of-Words. The context that is taken into account for each word are the 5 \
492
+ words before, and the 5 words after the target word."
493
+ )
494
+
495
+ with st.expander(r"$\textsf{\Large What is the cosine similarity value?}$"):
496
+ st.write(
497
+ "The cosine similarity is a measure of the distance between two words in the semantic space. \
498
+ More precisely, the cosine similarity is the cosine of angle between the two vectors in the \
499
+ multi-dimensional space. The value ranges from -1 to 1. The higher the value of the cosine \
500
+ similarity (the closer it is to 1), the closer two words are in the semantic space. For example, \
501
+ according to our model, the cosine similarity value of πατήρ and μήτηρ in the Classical period \
502
+ is 0.93, relatively high as we might expected for these obviously related words, while the \
503
+ cosine similarity value of a random pair like πατήρ and τράπεζα in the same time slice is \
504
+ 0.12, considerably lower."
505
+ )
506
+
507
+ with st.expander(r"$\textsf{\Large What are the 3D representations?}$"):
508
+ st.write(
509
+ "The 3D representation is a way to graphically visualize the semantic space, the method used \
510
+ on this website is called t-SNE. Semantic spaces are multi-dimensional, with as many \
511
+ dimensions as the digits in the vectors. The embeddings used for this interface only have 30 \
512
+ dimensions. A 3D representation reduces the dimensions to 3, to allow for graphic \
513
+ representation. Even if 3D representations are effective means of making a semantic space \
514
+ visible, **they are not 100% accurate**, since the visualization shows a reduction of the 30 \
515
+ dimensions. We thus advise not to base any conclusions on the graphic representation only, \
516
+ but to rely on nearest neighbours extraction and on cosine similarity."
517
+ )
518
+
519
+ with st.expander(r"$\textsf{\Large Is the information stored by Word Embeddings reliable?}$"):
520
+ st.write(
521
+ "The information stored in word embeddings is solely based on the training corpus. This \
522
+ means that our models have no additional knowledge of the Ancient Greek language and \
523
+ culture. All information extracted from a model thus reflect word co-occurrences, and word \
524
+ meaning, in its specific training corpus. \
525
+ \
526
+ Please take into account that the results for words occurring very rarely may be inaccurate. \
527
+ Language modelling works on a statistical basis, so that a word with only few occurrences \
528
+ may not provide enough evidence to obtain reliable results. But it has been observed that an \
529
+ extremely high word frequency can also affect the results. It often happens that the nearest \
530
+ neighbours to words occurring very often are other high-frequency words, such as stop \
531
+ words (e.g., prepositions, articles, particles). "
532
+ )
533
+
534
+ with st.expander(r"$\textsf{\Large What if I obtain 'strange' results?}$"):
535
+ st.write(
536
+ "For the abovementioned reasons mentioned, word embeddings are not always reliable \
537
+ methods of semantic investigation. Interpretation of the results is always needed to decide \
538
+ whether the results at hand are real patterns present in the corpus, and could thus reveal \
539
+ interesting phenomena, or just noise present in the data."
540
+ )
541
+
542
+ with st.expander(r"$\textsf{\Large How can word embeddings help us study semantic change?}$"):
543
+ st.write(
544
+ "Cosine similarity can be computed between vectors of the same word in different time slices. \
545
+ The higher the cosine similarity, the more similar the usage of a word is in the two considered \
546
+ time slices. If the cosine similarity between a word’s vectors in two consecutive time slices is \
547
+ particularly low, there is a chance that semantic change happened at that point in time. The \
548
+ analysis of the nearest neighbours to the target word in the two slices can help clarifying if \
549
+ change actually happened, and which is its direction."
550
+ )
551
+
552
+ st.markdown("""
553
+ ## References
554
+
555
+ Bianchi, F., Di Carlo, V., Nicoli, P., & Palmonari, M. (2020). Compass-aligned distributional
556
+ embeddings for studying semantic differences across corpora. *arXiv preprint
557
+ arXiv:2004.06519*.
558
+
559
+ Lenci, A., & Sahlgren, M. (2023). *Distributional semantics*. Cambridge University Press.
560
+
561
+ Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word
562
+ representations in vector space. *arXiv preprint arXiv:1301.3781*.
563
+
564
+ Vatri, A., & McGillivray, B. (2018). The Diorisis ancient Greek corpus: Linguistics and
565
+ literature. *Research Data Journal for the Humanities and Social Sciences*, 3(1), 55-65.
566
+ """)
567
+
568
+
569
 
570
  if selected == "License":
571
  st.markdown("""