Design of hash(::BioSequence). What to do?

So the design of BioSequences makes it difficult to implement efficient and correct hashing.
We want efficient hashing, because hashing underlies operations like putting stuff in a `Set` or `Dict`, which users expect to be fast.

The issue is
* Julia requires that `isequal(a, b) === isequal(hash(a, x), hash(b, x))`
* `isequal` ought to allow objects of different types to be equal if it represents the same value. E.g. `isequal(0, 0.0)`. Breaking this leads to lots of confusion.

So, to follow these two rules
* Different subtypes of `BioSequence` should be `isequal` if they have the same content, i.e. `isequal(dna"TAG", DNAKmer("TAG"))`
* Which implies they should hash equally,

Now, how do we get two `BioSequence`s with arbitrary encoding to hash equivalently? As I see it, it means we can't hash the encoded data, because the encoded data may vary between subtypes. However, this is presumably the only way to avoid decoding, which is the only way to make hashing fast!

Incidentally, the current implementation of `hash` for `LongSequence` is broken:
```
julia> (a, b) = (LongDNA{4}("A"), LongDNA{2}("A"));

julia> a in [b] # because they are equal
true

julia> a in Set([b]) # hashes wrong
false
```

Here are some possible solutions:
1. Just implement hashing by hashing every element. This will make hashing significantly slower (in tests, ~75 times slower), but it will be simple and correct.
2. Make one encoding privileged, e.g. `LongSequence`'s. When hashing any other type, we re-code it to that encoding before hashing. This will keep LongSequence hashing fast, but will make everything else both slower and much more complex. If we do this, we also need to recode `LongSequence{NucleicAcidAlphabet{2}}`.

There may be other solutions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design of hash(::BioSequence). What to do? #243

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design of hash(::BioSequence). What to do? #243

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions