-
Notifications
You must be signed in to change notification settings - Fork 48
Open
Labels
Description
So the design of BioSequences makes it difficult to implement efficient and correct hashing.
We want efficient hashing, because hashing underlies operations like putting stuff in a Set
or Dict
, which users expect to be fast.
The issue is
- Julia requires that
isequal(a, b) === isequal(hash(a, x), hash(b, x))
isequal
ought to allow objects of different types to be equal if it represents the same value. E.g.isequal(0, 0.0)
. Breaking this leads to lots of confusion.
So, to follow these two rules
- Different subtypes of
BioSequence
should beisequal
if they have the same content, i.e.isequal(dna"TAG", DNAKmer("TAG"))
- Which implies they should hash equally,
Now, how do we get two BioSequence
s with arbitrary encoding to hash equivalently? As I see it, it means we can't hash the encoded data, because the encoded data may vary between subtypes. However, this is presumably the only way to avoid decoding, which is the only way to make hashing fast!
Incidentally, the current implementation of hash
for LongSequence
is broken:
julia> (a, b) = (LongDNA{4}("A"), LongDNA{2}("A"));
julia> a in [b] # because they are equal
true
julia> a in Set([b]) # hashes wrong
false
Here are some possible solutions:
- Just implement hashing by hashing every element. This will make hashing significantly slower (in tests, ~75 times slower), but it will be simple and correct.
- Make one encoding privileged, e.g.
LongSequence
's. When hashing any other type, we re-code it to that encoding before hashing. This will keep LongSequence hashing fast, but will make everything else both slower and much more complex. If we do this, we also need to recodeLongSequence{NucleicAcidAlphabet{2}}
.
There may be other solutions.