Performance Analysis - Pure Python
Number of effective sequences implemented in Pure Python
In the previous post I have compared various languages and libraries in terms of their speed. This notebook contains the code used in the comparison as well as some details about the choices made to improve the performance of pure python implementation.
# ! pip install pandas
I will cheat here and use pandas to help me to read the file.
import pandas as pd
def get_data(path):
fasta_df = pd.read_csv(path, sep="\n", lineterminator=">", index_col=False, names=['id', 'seq'])
return fasta_df.seq.to_list()
seqs = get_data('../data/picked_msa.fasta')
Just to remind the pseudo code looks like this:
for seq1 in seqs:
for seq2 in seqs:
if count_mathes(seq1, seq2) > threshold:
weight +=1
meff += 1/weight
meff = meff/(len(seq1)^0.5)
It translates into python relatively easy
def get_nf_python(seqs, threshold=0.8):
n_seqs, seq_len = len(seqs), len(seqs[0])
meff = 0
for seq1 in seqs:
c = 0
for seq2 in seqs:
identity = 0
for p in range(seq_len):
identity = identity + int(seq1[p] == seq2[p])
is_more = int((identity/seq_len) > threshold)
c = c + is_more
meff = meff + 1/c
return meff/(seq_len**0.5)
This is O((n^2)*l)
complexity (where l
is length in the sequence) as we need to go through the n sequences for every n sequences (pairwise operation) and compare each element in the sequence. This is not ideal and the scale of such an algorithm is poor. If interested you can read more about this here.
%%timeit -n 3 -r 3
get_nf_python(seqs[:100])
get_nf_python(seqs[:100])
Fortunately, the algorithm can be improved by exploiting the fact that number of matches are the same between seq1
, seq2
and sequences seq2
, seq1
(the pairwise matrix is symmetric).
def get_nf_python_v2(seqs, threshold=0.8):
n_seqs, seq_len = len(seqs), len(seqs[0])
is_same_cluster = [[ 1 for _ in range(n_seqs)] for _ in range(n_seqs)]
meff = 0
for i in range(n_seqs):
for j in range(i+1,n_seqs):
identity = 0
for p in range(seq_len):
identity = identity + int(seqs[j][p] == seqs[i][p])
is_more = int((identity/seq_len) > threshold)
is_same_cluster[i][j]=is_more
is_same_cluster[j][i]=is_more
c = sum(is_same_cluster[i])
meff = meff + 1/c
return meff/(seq_len**0.5)
%%timeit -n 3 -r 3
get_nf_python_v2(seqs[:100])
get_nf_python_v2(seqs[:100])
Although, technically, this version is still O((n^2)*l)
, but in practise the speed improvement will be substantial (in this case, improvement is roughly 40%). Also, it is possible to break the last inner loop earlier, however, I did not see any performance gain (cost if
statement offset the gain from early stopping).
Note that I used only 1% of the data. Running on these algorithms on full data set would take too long and hence not recommended.