Initial commit: illustrative R data pipeline
This commit is contained in:
70
README.md
Normal file
70
README.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# powershell_example
|
||||
|
||||
This example demonstrates core programming principles that apply regardless of
|
||||
language — Excel, PowerShell, or R:
|
||||
|
||||
- **One job per script** — each script does exactly one thing
|
||||
- **Configuration over hardcoding** — constants like exchange rates live in `.env`, not buried in code
|
||||
- **Immutable inputs** — raw data is never modified; the pipeline can always be rerun from scratch
|
||||
- **Fail fast** — validation runs early and stops the pipeline with a clear message before bad data spreads
|
||||
- **Separation of concerns** — scripts don't know or care what runs before or after them
|
||||
- **Orchestration** — a single caller (`main.sh`) owns the sequence and can be scheduled via cron
|
||||
|
||||
## Project structure
|
||||
|
||||
```
|
||||
powershell_example/
|
||||
├── .env ← exchange rate and future config
|
||||
├── main.sh ← pipeline caller, runs all steps in order
|
||||
├── data/
|
||||
│ ├── raw/ ← original source, never modified
|
||||
│ ├── interim/ ← transformed working files (steps 03–06)
|
||||
│ ├── processed/ ← calculated output (step 07)
|
||||
│ └── formatted/ ← presentation-ready, rounded (step 08)
|
||||
└── scripts/
|
||||
├── 00_paths.R ← paths + config, sourced by all scripts
|
||||
├── 01_create_data.R ← creates wide CSVs → raw/
|
||||
├── 02_validate.R ← checks column counts, stops on failure
|
||||
├── 03_convert_currency.R ← EUR to USD, stays wide → interim/
|
||||
├── 04_pivot_income.R ← wide to long → interim/
|
||||
├── 05_convert_units.R ← thousands to persons, pivot pop to long → interim/
|
||||
├── 06_merge.R ← join income + population → interim/
|
||||
├── 07_calc.R ← income per person → processed/
|
||||
└── 08_format.R ← round to 2 decimals → formatted/
|
||||
```
|
||||
|
||||
## A note on what to commit
|
||||
|
||||
This repo commits everything for illustration purposes. In a real project you
|
||||
would typically exclude:
|
||||
|
||||
- **`.env`** — may contain API keys, credentials, or proprietary constants
|
||||
- **`data/`** — raw and processed data files are often too large for git and
|
||||
may contain proprietary or personally identifiable information
|
||||
|
||||
Both would normally be listed in `.gitignore`.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
bash /data/projects/r/powershell_example/main.sh
|
||||
```
|
||||
|
||||
## Scheduling with cron
|
||||
|
||||
Cron is the Linux/Mac equivalent of **Windows Task Scheduler** — it runs a
|
||||
program automatically on a schedule with no human intervention.
|
||||
|
||||
To run automatically every Monday at 8am:
|
||||
|
||||
```
|
||||
0 8 * * 1 /data/projects/r/powershell_example/main.sh >> /tmp/pipeline.log 2>&1
|
||||
```
|
||||
|
||||
**A note on corporate environments:** IT departments are often protective of
|
||||
who can schedule automated jobs on shared servers — and for good reason. Silent
|
||||
background processes can consume resources, touch shared databases, or trigger
|
||||
emails without anyone knowing they exist. On your own machine, Task Scheduler
|
||||
is fair game. On a company server, the right move is to document what the job
|
||||
does, show IT, and ask them to schedule it officially. That conversation also
|
||||
creates a paper trail, which matters in regulated industries.
|
||||
Reference in New Issue
Block a user