105 lines
4.1 KiB
Org Mode
Executable file
105 lines
4.1 KiB
Org Mode
Executable file
#+title: Lesson 04 | Data Wrangling
|
|
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="../_share/media/css/missing-semester.css" />
|
|
#+HTML_HEAD: <link rel="stylesheet" type="text/css" href="../_share/media/css/org-media-sass/collapsible.css" />
|
|
#+HTML_HEAD: <script src="../_share/media/js/collapsible.js"></script>
|
|
#+OPTIONS: H:6
|
|
|
|
* Links
|
|
#+attr_html: :class links
|
|
- [[../toc.org][TOC | Missing Semester]]
|
|
- [[https://www.youtube.com/playlist?list=PLyzOVJj3bHQuloKGG59rS43e29ro7I57J][Playlist: Missing Semester]]
|
|
- [[https://missing.csail.mit.edu/2020/data-wrangling/][class notes]]
|
|
|
|
- Curr: https://youtu.be/sz_dsktIjt4?si=XopbHGTFXY-I6Bkh&t=2577
|
|
|
|
*** timestamps
|
|
:PROPERTIES:
|
|
:CUSTOM_ID: timestamp
|
|
:END:
|
|
|
|
#+attr_html: :class playlist
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=4s][00:00 - introduction]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=415s][06:55 - Stream Editor]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=456s][07:36 - Replacement Expressions]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=538s][08:58 - Regular Expression]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=560s][09:20 - Regular Expressions]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=620s][10:20 - Square Brackets]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=693s][11:33 - Add Modifiers]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=776s][12:56 - Alternations]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1029s][17:09 - Anchoring the Regular Expression]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1138s][18:58 - Capture Groups]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1215s][20:15 - Regular Expression Debugger]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1450s][24:10 - Regular Sessions]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1561s][26:01 - Match and Email Address ]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=1743s][29:03 - Sort]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2040s][34:00 - Awk]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2324s][38:44 - Berkeley Calculator]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2437s][40:37 - Computer Statistics over Inputs]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2482s][41:22 - Summary Statistics]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2570s][42:50 - Two sort of special types]] *current*
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2513s][41:53 - Plotting]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2754s][45:54 - example where data wrangling is useful]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2805s][46:45 - image captures to standard output]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2846s][47:26 - operate on standard input]]
|
|
+ [[https://www.youtube.com/watch?v=sz_dsktIjt4&t=2880s][48:00 - display in an image display]]
|
|
|
|
* notes
|
|
** intro example
|
|
- using ssh someserver 'somecommand' will run that command on the server
|
|
- you could run a series of commands on the server instead of channeling all that info back
|
|
|
|
#+BEGIN_SRC bash
|
|
ssh someserver 'journalctl | grep sshd | grep "Disconnected from"' | less
|
|
#+END_SRC
|
|
|
|
- this will run journalctl on the server, find anything that says 'sshd' and 'disconnec..' in the results
|
|
- then send all those results back to our machine where we pipe it through 'less'
|
|
|
|
** SED
|
|
- stream editor
|
|
- allows you to make changes to the contents of a stream
|
|
- full programming langauge
|
|
- common task is to run replacement expressions on an input stream
|
|
|
|
*** example
|
|
#+BEGIN_SRc bash
|
|
sed 's/.*blahblah blah//'
|
|
#+END_SRC
|
|
|
|
- uses regular expressions
|
|
- way of matching text
|
|
|
|
*** sed modifiers
|
|
- (ab)* - remove zero or more of 'ab'
|
|
- -E use new replacement
|
|
- (ab|bc)* - remove 'ab' or 'bc'
|
|
|
|
** regex debugger
|
|
- regex101.com
|
|
|
|
** sort
|
|
- can sort by column
|
|
- sorts ascending by default
|
|
|
|
** awk
|
|
- programming language
|
|
- focused on columnar data
|
|
- can match by pattern
|
|
|
|
** paste
|
|
- takes input and puts it together how you want
|
|
- '-s' :: single line
|
|
- '-d' :: delimiter
|
|
|
|
** berkley calculator
|
|
- calculator that reads from stdin
|
|
|
|
** compute statistics
|
|
- R language is built for statistical analysis
|
|
|
|
** gnuplot
|
|
- plotter
|
|
- takes from stdin
|
|
|
|
** xargs
|
|
- takes lines of input and puts them into arguments
|