Skip to content

Design of hash(::BioSequence). What to do? #243

@jakobnissen

Description

@jakobnissen

So the design of BioSequences makes it difficult to implement efficient and correct hashing.
We want efficient hashing, because hashing underlies operations like putting stuff in a Set or Dict, which users expect to be fast.

The issue is

  • Julia requires that isequal(a, b) === isequal(hash(a, x), hash(b, x))
  • isequal ought to allow objects of different types to be equal if it represents the same value. E.g. isequal(0, 0.0). Breaking this leads to lots of confusion.

So, to follow these two rules

  • Different subtypes of BioSequence should be isequal if they have the same content, i.e. isequal(dna"TAG", DNAKmer("TAG"))
  • Which implies they should hash equally,

Now, how do we get two BioSequences with arbitrary encoding to hash equivalently? As I see it, it means we can't hash the encoded data, because the encoded data may vary between subtypes. However, this is presumably the only way to avoid decoding, which is the only way to make hashing fast!

Incidentally, the current implementation of hash for LongSequence is broken:

julia> (a, b) = (LongDNA{4}("A"), LongDNA{2}("A"));

julia> a in [b] # because they are equal
true

julia> a in Set([b]) # hashes wrong
false

Here are some possible solutions:

  1. Just implement hashing by hashing every element. This will make hashing significantly slower (in tests, ~75 times slower), but it will be simple and correct.
  2. Make one encoding privileged, e.g. LongSequence's. When hashing any other type, we re-code it to that encoding before hashing. This will keep LongSequence hashing fast, but will make everything else both slower and much more complex. If we do this, we also need to recode LongSequence{NucleicAcidAlphabet{2}}.

There may be other solutions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions