🤗 ParsBERT: Transformer-based Model for Persian Language Understanding

ParsBERT trained on a massive amount of public corpora (Persian Wikidumps, MirasText) and six other manually crawled text data from a various type of websites (BigBang Page scientific, Chetor lifestyle, Eligasht itinerary, Digikala digital magazine, Ted Talks general conversational, Books novels, storybooks, short stories from old to the contemporary era).

As a part of ParsBERT methodology, an extensive pre-processing combining POS tagging and WordPiece segmentation was carried out to bring the corpora into a proper format.

Follow the rest of the repo for more details.

Paper link: 10.1007/s11063-021-10528-4

YouTube Demo !

ParsBERT

🤗 ParsBERT: Transformer-based Model for Persian Language Understanding

References

2020