(that you probably don't know)
Finite step list to solve a problem.
An algorithm must be seen to be believed.
— Donald Knuth
Functions to map arbitrary long data blobs to fixed-length integer numbers.
A cryptographic hash function has three propoerties:
A little modification in the message causes a total change in the hash.
Message: one ring to rule them all
SHA1: f0de86727cf32a37a8f6fdab54a62d1d910db235
Message: one ring to rule them al1
SHA1: d302c585c7e212822ed087e949216f9123657ebc
With 160 bits, you have 2160, or about 1048 (1 with 48 zeros after it), possible SHA1 hashes. That number is just incomprehensibly huge. Even if you hired a trillion people to produce a trillion new unique blobs per second for a trillion years, you would still only have about 1043 blobs.
If you hashed 2 80 random blobs, you might find a collision.
— From the book Version Control with Git
Cryptographic hash functions has lots of uses:
List of cryptographic hash functions at Wikipedia.
Git uses SHA1 hash function to check if a folder or file has changed.
There are mechanisms to attack hash functions such as that allowing to forgea couple of X.509 certificates with the same signature.
It is not and algorithm.
It is a system to verify if a protocol actor has spent some time performing an action.
Challenge / Answer:
Like in Telegram.
Solution verification:
Like in Bitcoin.
Some applications:
The proof of work is the answer to a question that must be...
For instance, one of the so-called proof-of-work functions such as...
Designed to prevent spam.
It's supported by the use of SHA1.
When sendin an e-mail, from the sender:
When receiving an e-mail, the recipient:
Hashcash was introduced by Adam Back in 1997
As the computational power increases, the number of required zeroes does at well.
Bitcoin uses Hashcash to perform the so-called bitcoin mining.
It is a technique to classify data in groups.
Each group has a representative, the centroid...
...and the points being represented.
It is broadly used in machine learning.
How many groups are there?
The number of groups, k
is provided by the user.
Formally, it computes the Voronoi diagram for k
points.
The diagram is a partition of k
groups where coloured regions are the closest areas for the point they keep..
We choose k
random points (i.e. 4).
We assign to each centroid the closest points.
And recompute the centroids as the average point of the group.
We can iterate a predefined number of steps or until the algorithm to converge.
The result...
The algorithm was proposed by Stuart Lloyd from Bell laboratories in 1957 but it was not published until 1982.
E. W. Forgy published almost the same method in 1965.
The former animation belongs to Kickammender, a service for classifying Kickstarter projects.
pngquant es un programa para reducir el tamaño de imágenes PNG mediante cuantización de color.
MapReduce is not an algorithm.
It is a programming model.
But there are highly paralelizable algorithms that can be adapted to MapReduce.
MapReduce consists into classifying values and aggreging them.
It is more or less equivalent to the following SQL code:
SELECT key, AGGREGATION(value)
FROM table
GROUP BY key
ORDER BY key
But with the difference that in MapReduce, we provide the behaviours for AGGREGATION
and GROUP BY
.
map :: data -> (key, value)
reduce :: (key, [values]) -> (key, aggregated-value)
Example: count the number of shapes of each color.
Prepare Map processors:
It suffices with one processor by key but it could be more to provide redundancy.
Execute the map()
function for input data.
def map(shape, star):
return (star.color, 1)
Prepare Reduce processors:
It suffices with a processor by key again.
Execute the reduce()
function for the key-grouped data.
def reduce(color, values):
return (color, sum(values))
Sort results by key:
Assuming Blue, Red and Green.
This way of counting is an algorithm.
An algorithm for MapReduce.
The map()
function classify the points into groups.
def map(id, point):
global centroids
distances = \
[(id, point.distance(c)) for id, c in enumerate(centroids)]
closest_centroid, _ = \
min(distancias, key=lambda i: i[0])
return (closest_centroid, point)
The reduce()
function calculate the new centroids.
def reduce(centrode_id, points):
global centroids
avg = lambda l: sum(l) / len(l)
new_centroid = \
Point(
avg(p.x for p in points),
avg(p.y for p in points)
)
centroids[centroid_id] = new_centroid
return (centroid_id, new_centroid)
Now repeat a finite number of steps...
Or until centroids keep unvariable.
Comparison with centroids is 100% paralelizable.
So the algorithm could split data into N machines...
Targeting speed-ups up to N in the grouping stage.
MapReduce was invented (and patent) by Google in 2004.
The free implementation is Hadoop from Apache.
You can find the results of several k-Means MapReduce implementations on the Internet
What is the best running time we can expect from a sorting algorithm?
You're probably thinking about Quicksort
In the average case: O(n log n)
But it can be better...
Consider the following list:
var list = [2, 5, 3, 0, 2, 3, 0, 3];
Let's count the number of distinct values:
var count = [0, 0, 0, 0, 0, 0];
list.forEach(function (value) { count[value]++; });
// count: [2, 0, 2, 3, 0, 1]
But let's fix the list to keep the total amount of values less or equal to each of the possible ones:
for (var i = 1; i < count.length; i++) {
count[i] += count[i-1];
}
// count: [2, 2, 4, 7, 7, 8]
This counting matches, for each possible value with the index at, starting from it, we won't find that value again.
// Remember list = [2, 5, 3, 0, 2, 3, 0, 3] y
// count: [2, 2, 4, 7, 7, 8]
// So value 2 will be at most at position (4 - 1);
// the 5 value, at 7 (8 - 1);
// the 3 value, at 6 (7 - 1);
// ...
So we can pass through the original list looking at where each value fits and fixing the counting vector each time we place a new value:
var sorted = [];
// sorting backwards to keep the sorting algorithm stable
for (var i = list.length - 1; i >= 0; i++) {
var value = list[i];
var rightPosition = count[value] - 1;
sorted[rightPosition] = value;
count[value]--;
}
// sorted: [0, 0, 2, 2, 3, 3, 3, 5]
This algorithm iterates once over the original list and twice on the counting vector. If the length of the vector is n
and the counting vector is k
, then...
...the algorithm operates on:
Θ (n + k)
But what about if the list is?
var list = [1, 0, 5, 3, 9, 8, 2, 0, 3, 5, 1, 19890];
Now k
dominates over n
.
What can we do?
Consider the following list:
var list = [329, 457, 657, 839, 436, 720, 355];
We can sort the numbers attending to each of its d
dígis.
Let's start with units...
var list = [720, 355, 436, 457, 657, 329, 839];
We follow with tens:
var list = [720, 329, 436, 839, 355, 457, 657];
And finally hundreds:
var list = [329, 355, 436, 457, 657, 720, 839];
It is easy to see the algorithm complexity is in:
Θ (d · sort-complexity)
If d
is constant, complexity is dominated by that from the sorting algorithm used for each digit.
And how to sort by each digit?
With counting sort!
So sort-complexity = Θ (n + k)
with k = 10
constant, and so the overall complexity for radix sorting will be:
Θ (d · (n + k))
And observing that d
is constant too and that k
can be very small compared with the proper n
, we have radix sort has a complexity of:
Θ (n)
Both algorithms were invented by Harold H. Seward in 1954.
An more detailed explanation can be found in the book Introduction to Algorithms.
You can see a performance comparison where radix sorting es 5 times faster than qsort for integers.