Administrator
|
Searching the archives, I found an interesting comment [1]:
"I think the UUIDGenerator in the image produces UUIDs which are good enough for MC. " - Levente Uzonyi I am pretty well confused by UUIDs in general (they seem magical) and Pharo's implementation. The use case I have in mind is a file library which imports files into a single folder, but changes their names to something guaranteed to be unique so that they don't overwrite each other. Would UUIDs work in that case? Would the image ones be "good enough"? Are primitive-generated UUIDs guaranteed to always be unique if, say, I move the image to another OS and continue generating them with another VM? Thanks! [1] http://forum.world.st/UUID-and-Cog-tp2955687p2957172.html
Cheers,
Sean |
The UUIDGenerator has unnecessarily heavy-handed implementation, however it does adhere to the RFC specs as far as I could tell. Which also means that it's definitely usable across images and platforms. If anything, the weak point would be the random number generator, not UUID. And at least on unix/linux it uses /dev/urandom so it should be pretty reliable. So ask yourself what's the probability of your PRNGs generating same 122 bits. This thread might also interest you http://forum.world.st/Contributing-to-VoyageMongo-improving-insertion-updating-speed-td4838806.html Peter On Tue, Aug 11, 2015 at 8:09 PM, Sean P. DeNigris <[hidden email]> wrote: Searching the archives, I found an interesting comment [1]: |
In reply to this post by Sean P. DeNigris
On Wed, Aug 12, 2015 at 2:09 AM, Sean P. DeNigris <[hidden email]> wrote:
> Searching the archives, I found an interesting comment [1]: > "I think the UUIDGenerator in the image produces UUIDs which are good > enough for MC. " - Levente Uzonyi > > I am pretty well confused by UUIDs in general (they seem magical) and > Pharo's implementation. The use case I have in mind is a file library which > imports files into a single folder, but changes their names to something > guaranteed to be unique so that they don't overwrite each other. Would UUIDs > work in that case? Would the image ones be "good enough"? Are > primitive-generated UUIDs guaranteed to always be unique if, say, I move the > image to another OS and continue generating them with another VM? Thanks! > > [1] http://forum.world.st/UUID-and-Cog-tp2955687p2957172.html > Unless your requirements *specifically* need identical files to be maintained as duplicates, I would strongly consider using something content based like MD5 or SHA. Guaranteed to remain the same between OS and good-enough uniqueness. Its also compatible with external tools. Pharo seems to have an implementation. https://en.wikipedia.org/wiki/Secure_Hash_Algorithm Depending on the breadth of your audience, you may want to base it off HashFunction and allow user configuration of algorithm. Selectivity between security and performance can be useful. btw1, git uses SHA-1... https://git-scm.com/book/en/v2/Git-Internals-Git-Objects btw2, UUID actually uses MD5 and SHA, but on a smaller input than full file contents. Cross platform may(?) have to contend with there being several revisions of UUID (unless handling all past versions is implicit in all implementations), particularly by external tools. https://en.wikipedia.org/wiki/Universally_unique_identifier btw3, I love turtles all the way down, but given that crypto algorithms are CPU bound and Pharo will be single-CPU for some time, it might be pragmatic to have the crypto primitives to thread onto a separate CPU, and maybe take advantage of hardware acceleration. Call it SHAxExternal... https://software.intel.com/en-us/articles/intel-sha-extensions cheers -ben |
btw2, UUID actually uses MD5 and SHA, but on a smaller input than full file contents. Pharo implements version 4, which uses purely random bits; not MD5/SHA/MAC.
There's an advantage of using UUIDs, because if you have larger files, hashing them might take a considerable amount of CPU time and disk I/O. But having it content-based is also an advantage, because it can be created independently (and verifiably). Peter |
Administrator
|
In reply to this post by Ben Coman
Interesting... There should not be any duplicates. What's the advantage over UUID?
Cheers,
Sean |
On Wed, Aug 12, 2015 at 6:38 PM, Sean P. DeNigris <[hidden email]> wrote:
> Ben Coman wrote >> Unless your requirements *specifically* need identical files to be >> maintained as duplicates, I would strongly consider using something >> content based like MD5 or SHA > > Interesting... There should not be any duplicates. What's the advantage over > UUID? > This... "move to another OS and continue generating [file-ids] with another VM? " An sha-hash intrinsically represents THE content (it is *always* the same no matter who/where/how its calculated), whereas a UUID is a randomly generated label assigned to the content. also, depending on use-case, it may facilitate... * easy to verify whether the file contents have changed. * periodic checking of backups/restores for file corruption (including by external tools without reference to indexes maintained by your Application Image. * facilitate revision control, if you come across a file whose filename doesn't match its contents-sha-hash, then you know its ancestor content by its current file-id. And for my own use case "some day".... I know have many duplicate files scattered amongst many adhoc backups. For example, over ten years several cycles of upgrading to a new PC where the quick-safe path taken was to copy the old PC hard drive to a subfolder on the new PC hard drive, but the old hard went into a box now with a dozen friends, plus duplication of many old small backups media (floppy, ZIP-Media, tape) that it wold help to consolidate onto several of todays large media. cheers -ben |
Free forum by Nabble | Edit this page |