Fast String Matching in Python
Fast String Matching in Python
io
pd.set_option('display.max_colwidth', -1)
names = pd.read_csv('data/sec_edgar_company_info.csv')
print('The shape: %d x %d' % names.shape)
names.head()
Line Number Company Name Company CIK Key
0 1 !J INC 1438823
1 2 #1 A LIFESAFER HOLDINGS, INC. 1509607
2 3 #1 ARIZONA DISCOUNT PROPERTIES LLC 1457512
3 4 #1 PAINTBALL CORP 1433777
4 5 $ LLC 1427189
TF-IDF
TF-IDF is a method to generate features from text by multiplying the frequency of a term (usually a
word) in a document (the Term Frequency, or TF) by the importance (the Inverse Document
Frequency or IDF) of the same term in an entire corpus. This last term weights less important words
(e.g. the, it, and etc) down, and words that don’t occur frequently up. IDF is calculated as:
IDF(t) = log_e(Total number of documents / Number of documents
with term t in it).
An example (from www.tfidf.com/):
Consider a document containing 100 words in which the word cat appears 3 times. The term
frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and
the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is
calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities:
0.03 * 4 = 0.12.
TF-IDF is very useful in text classification and text clustering. It is used to transform documents
into numeric vectors, that can easily be compared.
N-Grams
While the terms in TF-IDF are usually words, this is not a necessity. In our case using words as
terms wouldn’t help us much, as most company names only contain one or two words. This is why
we will use n-grams: sequences of N contiguous items, in this case characters. The following
function cleans a string and generates all n-grams in this string:
import re
ngrams('!J INC')
(0, 11) 0.844099068282
(0, 16196) 0.51177784466
(0, 15541) 0.159938115034
idx_dtype = np.int32
nnz_max = M*ntop
ct.sparse_dot_topn(
M, N, np.asarray(A.indptr, dtype=idx_dtype),
np.asarray(A.indices, dtype=idx_dtype),
A.data,
np.asarray(B.indptr, dtype=idx_dtype),
np.asarray(B.indices, dtype=idx_dtype),
B.data,
ntop,
lower_bound,
indptr, indices, data)
return csr_matrix((data,indices,indptr),shape=(M,N))
The following code runs the optimized cosine similarity function. It only stores the top 10 most
similar items, and only items with a similarity above 0.8:
import time
t1 = time.time()
matches = awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), 10, 0.8)
t = time.time()-t1
print("SELFTIMED:", t)
SELFTIMED: 2718.7523670196533
The following code unpacks the resulting sparse matrix. As it is a bit slow, an option to look at only
the first n values is added.
def get_matches_df(sparse_matrix, name_vector, top=100):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
Conclusion
As we saw by visual inspection the matches created with this method are quite good, as the strings
are very similar. The biggest advantage however, is the speed. The method described above can be
scaled to much larger datasets by using a distributed computing environment such as Apache Spark.
This could be done by broadcasting one of the TF-IDF matrices to all workers, and parallelizing the
second (in our case a copy of the TF-IDF matrix) into multiple sub-matrices. Multiplication can
then be done (using Numpy or the sparse_dot_topn library) by each worker on part of the second
matrix and the entire first matrix. An example of this is described here.