Training Naive Bayes… Really Fast
Performance tuning in Julia
In a recent lecture I demonstrated to my students how a Multinomial Naive Bayes (MNB) model can be used for document classification. As an example I used the Enron Email Dataset in order to create a spam filter based on such a model. The version of the dataset used consists of 33,716 emails, categorised as “spam” or “ham” (i.e. no spam).
We chose the MultinomialNBClassifier from the Julia MLJ.jl-package and for data preparation the CountTransformer from the same package. I was quite surprised that it took more than 30 minutes to train this classifier using the whole dataset (on an Apple M3, 16 GB RAM).
Typically only a part of a dataset is used for training as the rest is needed for testing. Using just 70% of the dataset for this purpose still took more than 10 minutes. The 33,716 emails are admittedly more than a simple textbook example, but on the other hand NB models are known for low training costs.
Therefore I began to investigate why it takes so long and if there are ways to make things faster. In the following sections I will present the performance tuning measures I’ve applied and the speedup which could be achieved. These measures are not very specific to this problem and thus should also be applicable in other situations.
Note: All implementations and benchmarks are done using Julia 1.10.3 on a M3 MacBook Pro with 16 GB RAM. The utilised Julia packages are MLJ 0.20.0, TextAnalysis 0.7.5, Metal 1.1.0, CategoricalArrays 0.10.8 and BenchmarkTools 1.5.0.
Training a Multinomial Naive Bayes model
But first let me introduce the main steps which are necessary to train a MNB in order to understand which algorithm has to be optimised. There is
- a data preparation step which converts the documents (in our case the emails) to an adequate data structure (a so-called document term matrix; DTM) and
- the actual training step where the DTM is aggregated into a vector for each class (spam or ham)
Data Preparation
Documents for use with an MNB are represented as “bags of words”. I.e. the order of the words within the document is considered irrelevant and only the number of occurrences of each word is stored. So the sentence “the cow eats grass” is in this representation equivalent to “eats the cow grass” or “grass eats the cow”.
In order to convert all documents into this form using a memory efficient representation, a dictionary of all words that occur in the documents is created (it’s basically an array of all words). Let’s say we have the following documents D1, D2 and D3:
- D1: “the grey cat lies on the grass”
- D2: “the cow eats grass”
- D3: “the cat is grey”
Then the dictionary is as follows: [“the”, “grey”, “cat”, “lies”, “on”, “grass”, “cow”, “eats”, “is”] as there are nine different words in the three documents.
Each document is then represented as an array of the same length as the dictionary and each element within this array is the number of occurrences of the corresponding word in the dictionary. So D1, D2 and D3 will have the form:
- D1: [2, 1, 1, 1, 1, 1, 0, 0, 0] as e.g. the first word in the dictionary (“the”) occurs twice, the second word (“grey”) occurs once and so on
- D2: [1, 0, 0, 0, 0, 1, 1, 1, 0]
- D3: [1, 1, 1, 0, 0, 0, 0, 0, 1]
If we combine these arrays into a matrix — one row for each document, we get the above mentioned document term matrix (DTM). In our case it is a 3 x 9 matrix, as we have three documents and a dictionary consisting of nine different words.
Training
The training of the MNB consists basically of adding all the document vectors, separated by class. I.e. in our spam-example we have to add all document vectors for “spam” and all for “ham” resulting in two vectors, each containing the summarised word frequencies for the respective class.
If we assume that the documents D1 and D3 are “ham” and D2 is “spam”, we would get the following results:
- “ham” word frequencies: [3, 2, 2, 1, 1, 1, 0, 0, 1]
- “spam” word frequencies: [1, 0, 0, 0, 0, 1, 1, 1, 0]
In a complete training step for a MNB there is some additional post-processing of these numbers, but the “expensive” part, which we want to optimise, is the aggregation of the DTM as shown here.
Starting with the Enron dataset
Data Preparation
I created the DTM for the Enron dataset using the CountTransformer, which is part of MLJ with the following function:
function transform_docs(doc_list)
CountTransformer = @load CountTransformer pkg=MLJText
trans_machine = machine(CountTransformer(), doc_list)
fit!(trans_machine)
return(MLJ.transform(trans_machine, doc_list))
end
The input doc_list to this function is an array of the tokenised emails. I.e. each word within a mail got separated into a single string (using TextAnalysis.tokenize()).
This results is a 33,716 x 159,093 matrix as there are 33,716 emails and the dictionary consists of 159,093 different words. This is a matrix with more than 5.3 billion elements. Surprisingly the creation of the DTM took less than a minute. So the focus of the performance tuning will be exclusively on the training step.
As the majority of elements of a DTM are 0, a so-called sparse matrix is used to store them in a memory efficient way (in Julia this is the SparseMatrixCSC type).
To be exact, the CountTransformer produces a data structure of type LinearAlgebra.Adjoint{Int64,SparseMatrixCSC{Int64, Int64}}. We will come to this special structure later.
Training
Training the MultinomialNBClassifier is then done as follows with X containing the DTM and y being the array of spam/ham labels (as a CategoricalArray since all MLJ models expect this type):
MultinomialNBClassifier = @load MultinomialNBClassifier pkg=NaiveBayes
nb_classifier = MultinomialNBClassifier()
nb_machine = machine(nb_classifier, X, y)
fit!(nb_machine, verbosity=0)
The call to fit! does the actual training and took more than 30 minutes for all Enron mails and more than 10 minutes for a 70%-subset.
In order to focus on the analysis and optimisation of the training step, I’m starting with my own implementation of a function that does the above mentioned aggregation of all document vectors into two vectors containing the summarised word frequencies for “spam” and “ham”. The respective code of the MultinomialNBCClassifier has too many dependencies which makes it much less feasible to demonstrate the following optimisation steps.
A first baseline approach for this function (called count_words) looks as follows:
function count_words_base(X::AbstractMatrix{Int64},y)
ndocs = size(X,1) # number of documents
nwords = size(X,2) # number of words in dictionary
ncats = length(levels(y)) # number of categories in `y`
wcounts = ones(Int64, ncats, nwords) # matrix for storing the word counts by category
for col in 1:nwords
for doc in 1:ndocs
if y[doc] == “ham”
wcounts[1,col] += X[doc, col]
else
wcounts[2,col] += X[doc, col]
end
end
end
return(wcounts)
end
Applied to X and y it takes 241.076 seconds to complete.
To reduce the runtime of the test runs and to avoid that memory becomes the decisive factor for the runtime, I’ve done all further tests (if not stated otherwise) on a part of the DTM (called Xpart) limited to the first 10,000 columns (i.e. a 33,716 x 10,000 matrix).
For this reduced DTM count_words_base needs 20.363 seconds to complete.
OPT1: Use the right data structures in the right way
An important aspect of performance tuning are the data structures used and the question if they are used in the most efficient manner.
Column-first Storage
In this sense count_words_base already uses an optimisation. In Julia a matrix is stored in a column-first order. I.e. that the elements of each column are stored close to each other in memory. So iterating over all elements in one column is faster than iterating over the elements within a row. Therefore the inner loop in count_words_base iterates over a column in X.
Column-first order storage is common practice in Julia. It also holds e.g. for a SparseMatrixCSC or a DataFrame. But it’s always a good idea to check which storage order a data structure uses.
CategoricalArrays
The if-statement in count_words_base is executed for each element of the DTM. So it surely helps to optimise this part of the function.
The parameter y is not a “normal” array which would store the words “ham” or “spam” 33,716 times. It is a CategoricalArray which stores these two words exactly once and uses internally an array of integers to store the 33,716 different “ham” and “spam” values (which are represented by the numbers 1 and 2). We can access this numerical representation using the function levelcode. So y[1] results in “ham”, whereas levelcode(y[1]) gives us 1.
Therefore we can replace the whole if-statement by the following single line (resulting in the first optimised version count_words_01):
wcounts[levelcode(y[doc]),col] += X[doc, col]
This gives us a runtime of 18.006 s which is an improvement of about 10%.
A more efficient Matrix
Often memory efficient data structures are less efficient when it comes to accessing their elements. So I suspected that a (dense) matrix (i.e. a 2-dimensional Array) might be more performant than the sparse matrix used for the DTM.
As a point of reference I created a dense matrix Xref (filled with random numbers) of the same size as Xpart: Xref = rand(0:9, 33716, 10000).
This matrix has the following runtimes:
- count_words_base: 2.378 s
- count_words_01: 0.942 s
So there must be a real problem with the DTM produced by CountTransformer. Already the baseline implementation gives us a speedup of more than 8x and the optimisation used in count_words_01 is more effective in this case and reduces the runtime to less than half of the baseline number!
As mentioned above, CountTransfomer doesn’t produce an actual SparseMatrixCSC but a LinearAlgebra.Adjoint{Int64,SparseMatrixCSC{Int64, Int64}}. I.e. the sparse matrix is wrapped in some other structure. This could be a problem. Therefore I tried to extract the actual sparse matrix … which proved to be difficult and expensive: It takes almost 17 s to do this!
But the resulting “pure” sparse matrix is much more efficient:
- count_words_base: 3.22 s
- count_words_01: 1.435 s
As we have to add almost 17 s for the extraction to these numbers, this doesn’t really improve the process as a whole. So I was looking for alternatives and found these within the TextAnalysis-package, which also has a function to create a DTM. The creation is as performant as with CountTransformer, but it produces a “pure” sparse matrix directly.
So we get the runtime numbers for the sparse matrix without having to add another 17 s. This results in a speedup at this point of 20.363/1.435 = 14.2.
OPT2: Multithreading
With Julia it is relatively easy to use multithreading. Especially in our case where we iterate over a data structure and access in each iteration another part of that data structure. So each iteration could potentially be executed within another thread without having to care about data access conflicts.
In this setting we just have to put the macro @threads in front of the for-statement and Julia does the rest for us. I.e. it distributes the different iterations to the threads which are available on a particular machine. As the M3 chip has eight kernels, I’ve set the JULIA_NUM_THREADS environment variable to 8 and changed the for-loop-part of the count_words-function as follows (resulting in the next optimised version count_words_02):
@threads for col in 1:nwords
for doc in 1:ndocs
wcounts[levelcode(y[doc]),col] += X[doc, col]
end
end
This gives us a runtime of 231 ms which is a speedup of 20.363/0.231 = 88.2.
OPT3: GPU and Matrix Operations
Getting even more performance is often achieved by using the GPU. But this can only be done, if the algorithm fits the quite special computing structure of a GPU. Ideally your algorithm should be made up of vector and matrix operations. So let’s explore, if our count_words function can be adapted this way.
Filtering Rows
Our example from above with just three documents D1, D2 and D3 is perhaps a good starting point to get a better understanding. X and y for that simple example are as follows:
X = [2 1 1 1 1 1 0 0 0; y = ["ham", "spam", "ham"]
1 0 0 0 0 1 1 1 0;
1 1 1 0 0 0 0 0 1]
The function count_words adds the numbers in the columns, but only for specific rows. In this example, first rows 1 and 3 are added and then we are looking at row 2. I.e. we need sort of a filter for the rows and then we can just sum up columns.
In Julia it is possible to index an array using a BitArray. I.e. X[[1,0,1],:] will give as rows 1 and 3 of X and X[[0,1,0],:] gives us row 2. We can get these “filters”, if we replace “ham” and “spam” in y by ones and zeros and convert it to the following matrix:
yb = [1 0;
0 1;
1 0]
So yb[:,1] would be the first filter and yb[:,2] the second one.
For the spam model we can convert the CategoricalArray y to such a bit matrix with the following function (y.refs is the internal representation using just integers):
function y_as_bitmatrix(y)
spam = y.refs .== 2
ham = y.refs .== 1
return([ham spam]) # Bit-Matrix (one column per category)
end
Using this representation of y we can implement count_words as follows:
function count_words_03(X::AbstractMatrix{Int64},y::BitMatrix)
nwords = size(X,2). # number of words in dictionary
ncats = size(y,2) # number of categories in `y`
wcounts = ones(Int64, ncats, nwords) # matrix for storing the word counts by category
for cat in 1:ncats
@threads for col in 1:nwords
wcounts[cat,col] = sum(X[y[:,cat],col])
end
end
return(wcounts)
end
This variant has a runtime of 652 ms (on CPU). So not faster than our last version above, but we are still exploring.
Dot Product
Let’s go back again to the simple three document example:
X = [2 1 1 1 1 1 0 0 0; yb = [1 0;
1 0 0 0 0 1 1 1 0; 0 1;
1 1 1 0 0 0 0 0 1] 1 0]
We can also achieve our goal, if we compute the dot product of each column in X first with the first column of yb and then doing the same with the second column of yb. This leads to count_words_04:
function count_words_04(X::AbstractMatrix{Int64},y::BitMatrix)
nwords = size(X,2) # number of words in dictionary
ncats = size(y,2) # number of categories in `y`
wcounts = ones(Int64, ncats, nwords) # matrix for storing the word counts by category
for cat in 1:ncats
@threads for col in 1:nwords
wcounts[cat,col] = dot(X[:,col], y[:,cat])
end
end
return(wcounts)
end
This results in a runtime of 4.96 ms (on CPU) which is now a speedup of 20.363/0.00496 = 4,105.4!
This drastic improvement needs perhaps some explanation. Two things go here hand in hand:
- Vector operations like the dot product are super optimised in Julia relying on proven libraries like BLAS.
- The sparse matrix type is very efficient in this context. Our dense reference matrix Xref has a runtime of only 455.7 ms in this case.
Matrix Multiplication
Taking the ideas from above a bit further we can represent yb in its transposed form as follows:
ybt = [1 0 1; X = [2 1 1 1 1 1 0 0 0;
0 1 0] 1 0 0 0 0 1 1 1 0;
1 1 1 0 0 0 0 0 1]
This depiction makes the shortest and probably most elegant version of count_words more or less obvious. It is just a matrix multiplication:
function count_words_05(X::AbstractMatrix{Int64},y::BitMatrix)
transpose(Y) * X
end
It is also the fastest version with 1.105 ms leading to a speedup of 20.363/0.00105 = 19,393!
Multithreading is here implicit as the underlying BLAS library is by default multithreaded. The number of threads used can be obtained by BLAS.get_num_threads().
Moreover this solution scales well. Applied to the complete dataset, the matrix X with 33,716 x 159,093 elements, it takes 13.57 ms to complete. This is a speedup of 241.076/0.01357 = 17,765.
OPT4: GPU
Finally, applying the last variant to the GPU can be done using the Metal.jl-package. For this purpose the matrices used have only to be converted to their corresponding metal array type using the mtl-function:
const mtl_Xpart = mtl(Xpart)
const mtl_yb = mtl(yb)
The count_words variant for the GPU is, apart from the data types, the same as above:
function count_words_06(X::MtlMatrix,y::MtlMatrix)
transpose(y) * X
end
Its runtime is only 0.306 ms. But copying the data to the GPU (using mtl) takes much longer than the time gained by running the algorithm on the GPU. So it’s not really faster.
Apart from that, the Metal-package for Apple silicon GPUs is not quite as mature as e.g. CUDA.jl. This becomes visible when trying to convert the large matrix X to a metal array: The conversion stops with an error message.
Conclusion
Of course not every algorithm can be converted to such a concise variant as we have in count_words_05. But even the more “classic” implementation count_words_04 is more than 4,000 times faster than our starting point. Many of the performance tuning measures presented in this article can be applied to other functions too. Apart from this, I would recommend anyone, who wants go get more performance out of a Julia program, to follow the “Performance Tips” in the Julia documentation.
Train Naive Bayes … really fast was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Train Naive Bayes … really fast