Tabular Data Structures for Data Analysis - Text Data Grooming
Hi Oleks,
This is probably out of scope (check with your mentors), but I was curious how text processing and inter-table lookups might fit into you project?
Check out the Excel side-story at [1] (starting just above the picture of the red tree) "Everybody thought of Excel as a financial modeling application [...] for creating calculation models with formulas and stuff. ... [but] most people just used Excel to make lists". This is exactly my personal experience.
The greatest practical value I've had from Excel is massaging text lists. Some use cases:
A. getting a bunch of messy hand typed data that needs cleaning up to be consistent before deriving statistics from it. There is no sense programming the whole workflow since its a highly iterative process to work through issues as you discover them. Issues are on-shot, but you want to apply it to 1000 rows.
B. tracking the status of 1000s of drawings as they progress through a workflow. Tagging who is working on each drawing, status and using AutoFilter to print out "Drawings To Go" lists.
C. managing a photo shoot for a school ball, recording orders and generating filenames to copy originals to a USB stick to take to the photo lab, such that duplicate orders each have a single file with the filename printed on the back making it easier to sort into envelopes for each order. (This is from last week so I can provide a sample Excel file for it)
For a detailed example, sometimes I might need to compare the basenames of the files in two folders:
1. In both folders A & B, from the command line I'd do `ls -l > ls.txt`
2. I'd load both text files into Excel, one per tab, where each line appeared in a single column.
3. I split on spaces to convert single column into multiple columns [2]
4. I'd split off the filename extension. So I'd need to reverse the string to determine the location of the last period and store that in a temporary column. Then two more columns hold the formulas extracting the first and second parts of the split using LEFT() and RIGHT().
5. Then in next column I'd have a formula using VLOOKUP() [4] to see if a row's basename appears in any row of the other tab.
6. I'd then do conditional formatting highlighting the row green if the VLOOKUP() was found.
"Text Data Grooming" would be a popular horizonal application demonstrating Pharo's liveness (indeed Excel can be considered a "live" environment., saving its "image" to a xls file). I'd very much love to use Pharo for this instead of Excel.