This is an outline of how I'd explain hash to a novice programmer. I've given versions of this explanation hundreds of times to thousands of students, beginners and experts alike.
Students tell me they find it compelling because it helps them connect w/ what's really going on by emphasizing both the technical and human/design elements.
I'm assuming the student is comfortable with:
- Basic language constructs and concepts like variables, conditionals, functions, etc.
- Using collection types with built-in types
- Defining and using custom classes/objects
- Using some magic methods with custom objects to override default behavior
The way I explain magic methods is that Python knows what built-in operations like ==, +, *, etc. mean for built-in types because a human programmed the Python interpreter to handle all those cases. + means something different for numbers vs. strings vs. lists, for example, but it means something different because a human told it.
However, with objects we define ourselves, Python has no way of knowing what + is supposed to mean unless we tell it. In this instance, we're the human who has to tell Python how to handle our case.
Magic methods are Python's way of letting us tell it how to use built-in operations with our custom objects. In other words, students are conditioned to continuously ask and reflect on the question: "How could Python know, how does it know, and who told it?"
Python lets you use custom objects as keys in a dictionary, but if we redefine __eq__ we see something surprising:
# Insert relevant exampleTo understand what's going on and how to fix it we need to understand Python's hash and __hash__.
If all you want is the "how to fix it", this is the rule. For dict to work as you expect for custom objects you define, you must ensure the following is always true:
- If
a == bthenhash(a) == hash(b)
In the same way that Python determines how == works with our custom object by calling our magic method __eq__, it determines how hash works with our custom object by calling our magic method __hash__.
(Note: I usually have a table of magic methods + semantic effects I've been building up, which I'd include here with __hash__ and hash added)
In short:
dictsomehow useshashto do its jobdictrequires thathash(a) == hash(b)whenevera == b- For our custom objects, changing
__eq__changes howa == bworks - For our custom objects, changing
__hash__changes howhash(a)works - If we change
__eq__, we become responsible for changing__hash__so thathashcontinues to work asdictexpects when we use our custom objects as keys
This might seem annoying and arbitrary, but features of Python you know and love mean it has to work this way. You'll know you'll understand what's going on when you have a hard time imagining it working any other way.
However, if the above explanation is satisfactory then it's not essential you read on. We think it's useful and enlightening, but it's not essential. Just remember that dict needs hash to play nice with == and we become responsible for ensuring that remains the case whenever we define a custom __eq__.
If the explanation unsatisfactory or you're curious, continue reading to learn more about...
- Why
dictneeds something likehashto do its job (in ANY language, not just Python) - How exactly
dictuseshash
Note Here's where I'd outline the reasoning for hash, which I modulate as appropriate for the audience.
To see why something like hash is necessary, let's think about how we might build dict ourselves if all we had were Python arrays/lists. What problem is hash a solution for?
-
It wouldn't be too hard to make something like
dictthat was slow, e.g., a giant array of(key, value)pairs that we always searched through linearly.Note: IME, if you ask an absolute novice to make something that works like
dictusing only arrays/lists, they almost all propose this. Most also immediately see that it's not great, so offer up an answer like "Well, you could do X, but, uhhh....not sure that's what you're looking for." -
Getting information out of an array with a numerical index is fast because we have special hardware (RAM) designed to make that fast. Is there some way we can use this? Yes!
Note: Again, emphasizing the human element: arrays are fast because HUMANS MADE THEM FAST.
In fact, early computer memory, e.g., the delay-line memory in the EDVAC did not have constant-time random access. Accessing values in an array by index was not "fast" on those computers.
I often bring in the history.
-
With a
dict, we want to retrieve information with a non-numerical key. -
One strategy would be to convert the key into a number and then use an array (which is fast). This is what we do.
-
hashis that conversion function -
It's possible for
hash(a)andhash(b)to assign the same index to different valuesaandb. This is called a "collision". -
Do you see why collisions would be a problem? Good! That means you're well on your way to understanding how
dict-like collections work. -
There are a variety of ways to solve the "collision problem". There's a whole wikipedia page on it: hash collisions
Like before, it's not essential you understand how we solve the collision problem, though you might be able to think of a few ways you could. Different languages and implementations use different methods. The exact method a language runtime uses might change from version to version, even.
The key thing is that there's no mystery here. There's only a sequence of problems and solutions, each solved one after the other by other people like you.