ParsBERT
A monolingual language model based on Google’s BERT architecture.
🤗 ParsBERT: Transformer-based Model for Persian Language Understanding
ParsBERT trained on a massive amount of public corpora (Persian Wikidumps, MirasText) and six other manually crawled text data from a various type of websites (BigBang Page scientific
, Chetor lifestyle
, Eligasht itinerary
, Digikala digital magazine
, Ted Talks general conversational
, Books novels, storybooks, short stories from old to the contemporary era
).
As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpora into a proper format.
Follow the rest of the repo for more details.
Paper link: 10.1007/s11063-021-10528-4
References
2020
- ParsBERT: Transformer-based Model for Persian Language UnderstandingNeural Processing Letters, 2020