Certainly Releases the First-Ever Norwegian BERT Model, Improves the Danish Model, and Starts a Model Zoo Initiative
Certainly’s open-source Danish BERT Model has sparked quite a bit of interest. Danish newspaper Børsen wrote an article about it, and many Danish data scientists have been involved in discussions about it on GitHub.
Many of our customers here at Certainly are also running experiments and are using the models for different projects.
Today, Certainly’s data science team has released the first-ever BERT model trained on Norwegian language data.
The most important aim of this is to help data scientists in Norway build state-of-the-art Natural Language Processing solutions.
We encourage Norwegian data scientists and managers to reach out to us, just as the Danish community did.
Today, we are also releasing an improved version of the Danish model. You can find both the updated Danish BERT model and the new Norwegian BERT model in the same GitHub repository.
Why Release a Norwegian BERT Model?
The Norwegian language is only spoken in Norway, where there are approximately 4.6 million native speakers.
Like Danish, this means that the language is often overlooked for Natural Language Processing tools.
By open-sourcing a Norwegian BERT model, we hope to help the community build their own Natural Language Processing solutions.
Conversational AI from Certainly supports Norwegian out of the box.
By using our prebuilt intents for Norwegian, it’s easy to build a personalized, state-of-the-art chatbot.
How Do We Train the Models?
We train BERT models on a new kind of computer chip called a TPU, short for Tensor Processing Unit.
This kind of chip is excellent at “Tensor” operations, which is perfect for training Deep Neural Networks.
The same way that “Vector” means a list of numbers, and “Matrix” means a rectangle of numbers, a “Tensor” is just a fancy word for a box of numbers.
A 1-dimensional tensor is a vector, a 2-dimensional tensor is a matrix and anything with more dimensions, such as a box, is a tensor.
Renting Google’s TPUs – which is the only way to access them – costs a lot of money.
In short, TPUs are expensive to use, so it is important to make the algorithms run as fast as possible to decrease cost.
Where Does the Training Data Come From?
We use text fetched from the internet to train our BERT models.
The non-profit organization, Common Crawl, periodically gathers huge amounts of data from the internet.
By automatically detecting the language of the text, we can create a data set of Norwegian data.
Because it takes a lot of time to read through the vast amounts of data, consequently we have run our algorithms on multiple computers at once.
And also ensure that our algorithms are extremely fast!
What Are We Going to Do Next?
Now that we have released a Norwegian model, we are going to target other Nordic languages, including Swedish and Finnish.
However, since NLP research is progressing so rapidly, it is becoming increasingly challenging to maintain a repository of models that are up-to-date with state-of-the-art research.
That is why we have decided to pick a different strategy.
Rather than releasing more European models, we are going to release our data sets formatted for training new BERT models in many different languages.
Importantly, we hope that we can get the European NLP community to help us train models that are up-to-date with state-of-the-art General Purpose Language Models.
Please share this article and remember to check the blog regularly for updates on our new Model Zoo initiative!
Article written by Jens Dahl Møllerhøj