Machine learning hashes out a way to speed up huge databases

March 13, 2023

undefined mins

Credit: invincible_bulldog/Getty

Researchers in the US and Germany used machine learning to build better hash functions and came up with a new method to boost the speed of online databases

Researchers from MIT and Harvard in the United States and the Technical University of Munich in Germany have demonstrated how a hash function can significantly speed up searches in a particularly large database. This new development could accelerate DNA analysis, amino acid sequences, or studies of other biological information.

Their research - supported, in part, by Google, Intel, Microsoft, the National Science Foundation, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator - will be presented at the International Conference on Very Large Databases.

Hashing is a fundamental operation in online databases such as library catalogues and e-commerce websites. Hash functions generate codes that replace data inputs, allowing for more accessible information retrieval as the codes are shorter and of fixed length. However, collisions can occur when two pieces of data are hashed with the same value, causing slower searches and reduced performance.

To avoid collisions, perfect hash functions are used, but they require more time to compute and must be constructed for each dataset. As hashing is used in various applications, including database indexing, data compression, and cryptography, fast and efficient hash functions are crucial. To improve hash function performance, researchers from MIT and other institutions investigated using machine learning to build better hash functions.

Their findings showed that using learned models, which are created by running a machine-learning algorithm on a dataset, can result in half as many collisions in certain situations. Learned models were also more computationally efficient than perfect hash functions.

“What we found in this work is that in some situations, we can come up with a better tradeoff between the computation of the hash function and the collisions we will face,” says Ibrahim Sabek, a postdoc in the MIT Data Systems Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL). “We can increase the computational time for the hash function a bit, but at the same time, we can reduce collisions very significantly in certain situations.”

As just one example, the technique could accelerate computational systems that scientists use to store and analyse DNA, amino acid sequences, or other biological information.

Sabek is co-lead author of the paper with electrical engineering and computer science (EECS) graduate student Kapil Vaidya. They are joined by co-authors Dominick Horn, a graduate student at the Technical University of Munich; Andreas Kipf, an MIT postdoc; Michael Mitzenmacher, professor of computer science at the Harvard John A. Paulson School of Engineering and Applied Sciences; and senior author Tim Kraska, associate professor of EECS at MIT and co-director of the Data Systems and AI Lab.

Perfect hash functions offer collision-free work

Traditional hash functions generate a random code corresponding to the slot where a data input or key will be stored. For example, if there are 10 keys to be placed into 10 slots, the function would generate a random integer between 1 and 10 for each input. However, collisions can occur when two keys end up in the same slot.

Perfect hash functions provide a collision-free solution. By providing the function with additional knowledge, such as the number of slots available for data placement, it can perform additional computations to determine the appropriate slot for each key, thereby avoiding collisions. However, the added computations make the function more challenging to create and less efficient.

“We were wondering, if we know more about the data — that it will come from a particular distribution — can we use learned models to build a hash function that can actually reduce collisions?” says Vaidya.

A data distribution displays all possible values in a dataset along with their frequency of occurrence. This distribution enables the calculation of the likelihood that a particular value is present in a data sample.

In their research, the team employed machine learning to estimate the shape of the data distribution, or how the data are distributed, by using a small sample from the dataset. The resulting learned model then predicts the location of a key in the dataset.

The researchers discovered that learned models are easier and faster to construct than perfect hash functions, resulting in fewer collisions than traditional hash functions when data is distributed predictably. However, if the gaps between data points vary too much, using learned models could result in more collisions.

“We may have a huge number of data inputs, and each one has a different gap between it and the next one, so learning that is quite difficult,” Sabek explains.

More sub-models lead to more accuracy

When data exhibited predictable distribution patterns, using learned models reduced the rate of colliding keys in a dataset from 30% to 15% compared to traditional hash functions. Additionally, they were able to achieve superior throughput than perfect hash functions. In optimal scenarios, using learned models resulted in a runtime reduction of almost 30%.

The researchers also discovered that the number of sub-models had the most significant impact on throughput when exploring the use of learned models for hashing. Each learned model comprises smaller linear models that approximate the data distribution. Increasing the number of sub-models enhances the accuracy of the learned model's approximation, but it also results in a longer processing time.

“At a certain threshold of sub-models, you get enough information to build the approximation that you need for the hash function. But after that, it won’t lead to more improvement in collision reduction,” says Sabek.

Building off this analysis, the researchers want to use learned models to design hash functions for other data types. They also plan to explore learned hashing for databases where data can be inserted or deleted. When data are updated this way, the model needs to change accordingly, but changing the model while maintaining accuracy is a complex problem.

“We want to encourage the community to use machine learning inside more fundamental data structures and operations,” says Sabek. “Any kind of core data structure presents us with an opportunity use machine learning to capture data properties and get better performance. There is still a lot we can explore.”

research hashing databases mit