Tabular Data Structures for Data Analysis - Text Data Grooming

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Tabular Data Structures for Data Analysis - Text Data Grooming

Ben Coman
Hi Oleks,

This is probably out of scope (check with your mentors), but I was curious how text processing and inter-table lookups might fit into you project?

Check out the Excel side-story at [1] (starting just above the picture of the red tree)  "Everybody thought of Excel as a financial modeling application [...] for creating calculation models with formulas and stuff. ... [but] most people just used Excel to make lists".  This is exactly my personal experience.

The greatest practical value I've had from Excel is massaging text lists.  Some use cases:

A. getting a bunch of messy hand typed data that needs cleaning up to be consistent before deriving statistics from it.  There is no sense programming the whole workflow since its a highly iterative process to work through issues as you discover them.  Issues are on-shot, but you want to apply it to 1000 rows.

B. tracking the status of 1000s of drawings as they progress through a workflow.  Tagging who is working on each drawing, status and using AutoFilter to print out "Drawings To Go" lists.
 
C. managing a photo shoot for a school ball, recording orders and generating filenames to copy originals to a USB stick to take to the photo lab, such that duplicate orders each have a single file with the filename printed on the back making it easier to sort into envelopes for each order. (This is from last week so I can provide a sample Excel file for it) 


For a detailed example, sometimes I might need to compare the basenames of the files in two folders:

1. In both folders A & B, from the command line I'd do `ls -l > ls.txt`

2. I'd load both text files into Excel, one per tab, where each line appeared in a single column. 

3. I split on spaces to convert single column into multiple columns [2]

4. I'd split off the filename extension.  So I'd need to reverse the string to determine the location of the last period and store that in a temporary column.  Then two more columns hold the formulas extracting the first and second parts of the split using LEFT() and RIGHT(). 

5. Then in next column I'd have a formula using VLOOKUP() [4] to see if a row's basename appears in any row of the other tab.

6. I'd then do conditional formatting highlighting the row green if the VLOOKUP() was found.


"Text Data Grooming" would be a popular horizonal application demonstrating Pharo's liveness (indeed Excel can be considered a "live" environment., saving its "image" to a xls file).  I'd very much love to use Pharo for this instead of Excel.

cheers -ben

[3] https://superuser.com/questions/121618/any-excel-function-that-will-reverse-a-string