Smalltalk › Pharo › Pharo Smalltalk Users

UUIDs

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

6 messages Options

Sean P. DeNigris

UUIDs

Administrator

Searching the archives, I found an interesting comment [1]:
"I think the UUIDGenerator in the image produces UUIDs which are good
enough for MC. " - Levente Uzonyi

I am pretty well confused by UUIDs in general (they seem magical) and Pharo's implementation. The use case I have in mind is a file library which imports files into a single folder, but changes their names to something guaranteed to be unique so that they don't overwrite each other. Would UUIDs work in that case? Would the image ones be "good enough"? Are primitive-generated UUIDs guaranteed to always be unique if, say, I move the image to another OS and continue generating them with another VM? Thanks!

[1] http://forum.world.st/UUID-and-Cog-tp2955687p2957172.html

Cheers,
Sean

Peter Uhnak

Re: UUIDs

The UUIDGenerator has unnecessarily heavy-handed implementation, however it does adhere to the RFC specs as far as I could tell.

Which also means that it's definitely usable across images and platforms.

If anything, the weak point would be the random number generator, not UUID. And at least on unix/linux it uses /dev/urandom so it should be pretty reliable.

So ask yourself what's the probability of your PRNGs generating same 122 bits.

This thread might also interest you http://forum.world.st/Contributing-to-VoyageMongo-improving-insertion-updating-speed-td4838806.html

Peter

On Tue, Aug 11, 2015 at 8:09 PM, Sean P. DeNigris <[hidden email]> wrote:

Searching the archives, I found an interesting comment [1]:
"I think the UUIDGenerator in the image produces UUIDs which are good
enough for MC. " - Levente Uzonyi

I am pretty well confused by UUIDs in general (they seem magical) and
Pharo's implementation. The use case I have in mind is a file library which
imports files into a single folder, but changes their names to something
guaranteed to be unique so that they don't overwrite each other. Would UUIDs
work in that case? Would the image ones be "good enough"? Are
primitive-generated UUIDs guaranteed to always be unique if, say, I move the
image to another OS and continue generating them with another VM? Thanks!

[1] http://forum.world.st/UUID-and-Cog-tp2955687p2957172.html

-----
Cheers,
Sean
--
View this message in context: http://forum.world.st/UUIDs-tp4842189.html
Sent from the Pharo Smalltalk Users mailing list archive at Nabble.com.

Ben Coman

Re: UUIDs

In reply to this post by Sean P. DeNigris

On Wed, Aug 12, 2015 at 2:09 AM, Sean P. DeNigris <[hidden email]> wrote:

> Searching the archives, I found an interesting comment [1]:
> "I think the UUIDGenerator in the image produces UUIDs which are good
> enough for MC. " - Levente Uzonyi
>
> I am pretty well confused by UUIDs in general (they seem magical) and
> Pharo's implementation. The use case I have in mind is a file library which
> imports files into a single folder, but changes their names to something
> guaranteed to be unique so that they don't overwrite each other. Would UUIDs
> work in that case? Would the image ones be "good enough"? Are
> primitive-generated UUIDs guaranteed to always be unique if, say, I move the
> image to another OS and continue generating them with another VM? Thanks!
>
> [1] http://forum.world.st/UUID-and-Cog-tp2955687p2957172.html
>

Unless your requirements *specifically* need identical files to be
maintained as duplicates, I would strongly consider using something
content based like MD5 or SHA. Guaranteed to remain the same between
OS and good-enough uniqueness. Its also compatible with external
tools. Pharo seems to have an implementation.
https://en.wikipedia.org/wiki/Secure_Hash_Algorithm

Depending on the breadth of your audience, you may want to base it off
HashFunction and allow user configuration of algorithm. Selectivity
between security and performance can be useful.

btw1, git uses SHA-1...
https://git-scm.com/book/en/v2/Git-Internals-Git-Objects

btw2, UUID actually uses MD5 and SHA, but on a smaller input than full
file contents. Cross platform may(?) have to contend with there being
several revisions of UUID (unless handling all past versions is
implicit in all implementations), particularly by external tools.
https://en.wikipedia.org/wiki/Universally_unique_identifier

btw3, I love turtles all the way down, but given that crypto
algorithms are CPU bound and Pharo will be single-CPU for some time,
it might be pragmatic to have the crypto primitives to thread onto a
separate CPU, and maybe take advantage of hardware acceleration. Call
it SHAxExternal...
https://software.intel.com/en-us/articles/intel-sha-extensions

cheers -ben

Peter Uhnak

Re: UUIDs

btw2, UUID actually uses MD5 and SHA, but on a smaller input than full

file contents.

Pharo implements version 4, which uses purely random bits; not MD5/SHA/MAC.

btw3, I love turtles all the way down, but given that crypto
algorithms are CPU bound and Pharo will be single-CPU for some time,
it might be pragmatic to have the crypto primitives to thread onto a
separate CPU, and maybe take advantage of hardware acceleration. Call
it SHAxExternal...
https://software.intel.com/en-us/articles/intel-sha-extensions

There's an advantage of using UUIDs, because if you have larger files, hashing them might take a considerable amount of CPU time and disk I/O.

But having it content-based is also an advantage, because it can be created independently (and verifiably).

Peter

Sean P. DeNigris

Re: UUIDs

Administrator

In reply to this post by Ben Coman

Ben Coman wrote

Unless your requirements *specifically* need identical files to be
maintained as duplicates, I would strongly consider using something
content based like MD5 or SHA

Interesting... There should not be any duplicates. What's the advantage over UUID?

Cheers,
Sean

Ben Coman

Re: UUIDs

On Wed, Aug 12, 2015 at 6:38 PM, Sean P. DeNigris <[hidden email]> wrote:
> Ben Coman wrote
>> Unless your requirements *specifically* need identical files to be
>> maintained as duplicates, I would strongly consider using something
>> content based like MD5 or SHA
>
> Interesting... There should not be any duplicates. What's the advantage over
> UUID?
>

This... "move to another OS and continue generating [file-ids] with
another VM? "
An sha-hash intrinsically represents THE content (it is *always* the
same no matter who/where/how its calculated), whereas a UUID is a
randomly generated label assigned to the content.

also, depending on use-case, it may facilitate...
* easy to verify whether the file contents have changed.
* periodic checking of backups/restores for file corruption (including
by external tools without reference to indexes maintained by your
Application Image.
* facilitate revision control, if you come across a file whose
filename doesn't match its contents-sha-hash, then you know its
ancestor content by its current file-id.

And for my own use case "some day".... I know have many duplicate
files scattered amongst many adhoc backups. For example, over ten
years several cycles of upgrading to a new PC where the quick-safe
path taken was to copy the old PC hard drive to a subfolder on the new
PC hard drive, but the old hard went into a box now with a dozen
friends, plus duplication of many old small backups media (floppy,
ZIP-Media, tape) that it wold help to consolidate onto several of
todays large media.

cheers -ben