#+title: Lesson 04 | Data Wrangling
#+HTML_HEAD:
#+HTML_HEAD:
#+HTML_HEAD:
#+OPTIONS: H:6
* Links
#+attr_html: :class links
- [[../toc.org][TOC | Missing Semester]]
- [[https://www.youtube.com/playlist?list=PLyzOVJj3bHQuloKGG59rS43e29ro7I57J][Playlist: Missing Semester]]
- [[https://missing.csail.mit.edu/2020/data-wrangling/][class notes]]
- Curr: https://youtu.be/sz_dsktIjt4?si=XopbHGTFXY-I6Bkh&t=2577
*** timestamps
:PROPERTIES:
:CUSTOM_ID: timestamp
:END:
#+attr_html: :class playlist
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=4s][00:00 - introduction]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=415s][06:55 - Stream Editor]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=456s][07:36 - Replacement Expressions]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=538s][08:58 - Regular Expression]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=560s][09:20 - Regular Expressions]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=620s][10:20 - Square Brackets]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=693s][11:33 - Add Modifiers]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=776s][12:56 - Alternations]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1029s][17:09 - Anchoring the Regular Expression]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1138s][18:58 - Capture Groups]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1215s][20:15 - Regular Expression Debugger]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1450s][24:10 - Regular Sessions]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1561s][26:01 - Match and Email Address ]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1743s][29:03 - Sort]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2040s][34:00 - Awk]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2324s][38:44 - Berkeley Calculator]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2437s][40:37 - Computer Statistics over Inputs]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2482s][41:22 - Summary Statistics]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2570s][42:50 - Two sort of special types]] *current*
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2513s][41:53 - Plotting]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2754s][45:54 - example where data wrangling is useful]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2805s][46:45 - image captures to standard output]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2846s][47:26 - operate on standard input]]
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2880s][48:00 - display in an image display]]
* notes
** intro example
- using ssh someserver 'somecommand' will run that command on the server
- you could run a series of commands on the server instead of channeling all that info back
#+BEGIN_SRC bash
ssh someserver 'journalctl | grep sshd | grep "Disconnected from"' | less
#+END_SRC
- this will run journalctl on the server, find anything that says 'sshd' and 'disconnec..' in the results
- then send all those results back to our machine where we pipe it through 'less'
** SED
- stream editor
- allows you to make changes to the contents of a stream
- full programming langauge
- common task is to run replacement expressions on an input stream
*** example
#+BEGIN_SRc bash
sed 's/.*blahblah blah//'
#+END_SRC
- uses regular expressions
- way of matching text
*** sed modifiers
- (ab)* - remove zero or more of 'ab'
- -E use new replacement
- (ab|bc)* - remove 'ab' or 'bc'
** regex debugger
- regex101.com
** sort
- can sort by column
- sorts ascending by default
** awk
- programming language
- focused on columnar data
- can match by pattern
** paste
- takes input and puts it together how you want
- '-s' :: single line
- '-d' :: delimiter
** berkley calculator
- calculator that reads from stdin
** compute statistics
- R language is built for statistical analysis
** gnuplot
- plotter
- takes from stdin
** xargs
- takes lines of input and puts them into arguments