SIG on Iranian languages

community

AI & ML interests

None defined yet.

Recent Activity

kargaranamirĀ  updated a Space 1 day ago
sigiranic/README
kargaranamirĀ  published a Space 1 day ago
sigiranic/README
View all activity

sigiranic's activity

kargaranamirĀ 
updated a Space 1 day ago
kargaranamirĀ 
published a Space 1 day ago
kargaranamirĀ 
posted an update 8 months ago
view post
Post
1199
Introducing GlotCC: a new 2TB corpus based on an early 2024 CommonCrawl snapshot with data for 1000+ languages.

šŸ¤— corpus v1: cis-lmu/GlotCC-V1
šŸ± pipeline v3: https://github.com/cisnlp/GlotCC

More details? Stay tuned for our upcoming paper.
More data? In the next version, we plan to include additional snapshots of CommonCrawl.

Limitation: Due to the lower frequency of low-resource languages compared to others, there are sometimes only a few sentences available for very low-resource languages. However, the data volume for English in this version stands at 750GB, and the top 200 languages still have a strong presence in our data (see plot attached; we write the index for every 20 languages, meaning the 10th index is the 200th language).
kargaranamirĀ 
posted an update 12 months ago