Introduction

In the previous post I have compared various languages and libraries in terms of their speed. This notebook contains the code used in the comparison as well as some details about the choices made to improve the performance of pure python implementation.

Setup

 
# ! pip install pandas

Getting data

I will cheat here and use pandas to help me to read the file.

import pandas as pd
def get_data(path):
    fasta_df = pd.read_csv(path, sep="\n", lineterminator=">", index_col=False, names=['id', 'seq'])
    return fasta_df.seq.to_list()
seqs = get_data('../data/picked_msa.fasta')

Python naive implementation

Just to remind the pseudo code looks like this:

for seq1 in seqs:
  for seq2 in seqs:
    if count_mathes(seq1, seq2) > threshold:
      weight +=1
  meff += 1/weight

meff = meff/(len(seq1)^0.5)

It translates into python relatively easy

def get_nf_python(seqs, threshold=0.8):
    n_seqs, seq_len = len(seqs), len(seqs[0])
    meff = 0
    for seq1 in seqs:
        c  = 0
        for seq2 in seqs:
            identity = 0
            for p in range(seq_len):                
                identity = identity + int(seq1[p] == seq2[p])
            is_more = int((identity/seq_len) > threshold)
            c = c + is_more
        meff = meff + 1/c
    return meff/(seq_len**0.5)

This is O((n^2)*l) complexity (where l is length in the sequence) as we need to go through the n sequences for every n sequences (pairwise operation) and compare each element in the sequence. This is not ideal and the scale of such an algorithm is poor. If interested you can read more about this here.

%%timeit -n 3 -r 3
get_nf_python(seqs[:100])
1.86 s ± 46.6 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
get_nf_python(seqs[:100])
0.18006706787628668

Fortunately, the algorithm can be improved by exploiting the fact that number of matches are the same between seq1, seq2 and sequences seq2, seq1 (the pairwise matrix is symmetric).

def get_nf_python_v2(seqs, threshold=0.8):
    n_seqs, seq_len = len(seqs), len(seqs[0])
    is_same_cluster = [[ 1 for _ in range(n_seqs)] for _ in range(n_seqs)]
    meff = 0
    for i in range(n_seqs):
        for j in range(i+1,n_seqs):
            identity = 0
            for p in range(seq_len):                
                identity = identity + int(seqs[j][p] == seqs[i][p])
            is_more = int((identity/seq_len) > threshold)
            is_same_cluster[i][j]=is_more
            is_same_cluster[j][i]=is_more
        c = sum(is_same_cluster[i])
        meff = meff + 1/c
    return meff/(seq_len**0.5)
%%timeit -n 3 -r 3
get_nf_python_v2(seqs[:100])
1.11 s ± 8.22 ms per loop (mean ± std. dev. of 3 runs, 3 loops each)
get_nf_python_v2(seqs[:100])
0.18006706787628668

Although, technically, this version is still O((n^2)*l), but in practise the speed improvement will be substantial (in this case, improvement is roughly 40%). Also, it is possible to break the last inner loop earlier, however, I did not see any performance gain (cost if statement offset the gain from early stopping).

Note that I used only 1% of the data. Running on these algorithms on full data set would take too long and hence not recommended.