Zabanshenas
is a solution for identifying the most likely language of a piece of written text.
Zabanshenas
A Transformer-based solution for identifying the most likely language of a written document/text. Zabanshenas is a Persian word that has two meanings:
- A person who studies linguistics.
- A way to identify the type of written language.
In this repository, I will use another perspective in creating a language detection model using Transformers. Nowadays, Transformers have played a massive role in Natural Language Processing fields. In short, Transformers uses an attention mechanism to boost the speed and extract a high level of information (abstraction).
There are plenty of ways, solutions, and packages to find the language of a written piece of text or document. All of them have their pros and cons. Some able to detect faster and support as many languages as possible. However, in this case, I intend to use Transformers to understand similar groups of languages and cover 235 languages thanks to WiLI-2018 and the Transformer architecture.
This model can detect a written language in three different stages: paragraph, sentence, and subset of text between three and four tokens.