Smalltalk › Pharo › Pharo Smalltalk Users

Tabular Data Structures for Data Analysis - Text Data Grooming

_‹ Previous Topic Next Topic _›

Classic

List

Threaded

1 message

Ben Coman

Tabular Data Structures for Data Analysis - Text Data Grooming

Hi Oleks,

This is probably out of scope (check with your mentors), but I was curious how text processing and inter-table lookups might fit into you project?

Check out the Excel side-story at [1] (starting just above the picture of the red tree) "Everybody thought of Excel as a financial modeling application [...] for creating calculation models with formulas and stuff. ... [but] most people just used Excel to make lists". This is exactly my personal experience.

The greatest practical value I've had from Excel is massaging text lists. Some use cases:

A. getting a bunch of messy hand typed data that needs cleaning up to be consistent before deriving statistics from it. There is no sense programming the whole workflow since its a highly iterative process to work through issues as you discover them. Issues are on-shot, but you want to apply it to 1000 rows.

B. tracking the status of 1000s of drawings as they progress through a workflow. Tagging who is working on each drawing, status and using AutoFilter to print out "Drawings To Go" lists.

C. managing a photo shoot for a school ball, recording orders and generating filenames to copy originals to a USB stick to take to the photo lab, such that duplicate orders each have a single file with the filename printed on the back making it easier to sort into envelopes for each order. (This is from last week so I can provide a sample Excel file for it)

For a detailed example, sometimes I might need to compare the basenames of the files in two folders:

1. In both folders A & B, from the command line I'd do `ls -l > ls.txt`

2. I'd load both text files into Excel, one per tab, where each line appeared in a single column.

3. I split on spaces to convert single column into multiple columns [2]

4. I'd split off the filename extension. So I'd need to reverse the string to determine the location of the last period and store that in a temporary column. Then two more columns hold the formulas extracting the first and second parts of the split using LEFT() and RIGHT().

5. Then in next column I'd have a formula using VLOOKUP() [4] to see if a row's basename appears in any row of the other tab.

6. I'd then do conditional formatting highlighting the row green if the VLOOKUP() was found.

"Text Data Grooming" would be a popular horizonal application demonstrating Pharo's liveness (indeed Excel can be considered a "live" environment., saving its "image" to a xls file). I'd very much love to use Pharo for this instead of Excel.

cheers -ben

[1] https://www.joelonsoftware.com/2012/01/06/how-trello-is-different/

[2] https://support.office.com/en-us/article/Split-text-into-different-columns-with-the-Convert-Text-to-Columns-Wizard-30b14928-5550-41f5-97ca-7a3e9c363ed7

[3] https://superuser.com/questions/121618/any-excel-function-that-will-reverse-a-string

[4] https://support.office.com/en-us/article/VLOOKUP-function-0bbc8083-26fe-4963-8ab8-93a18ad188a1