Skip to content

Commit b496826

Browse files
MB Diff with simple batch CLI interface
1 parent 8af4470 commit b496826

File tree

1 file changed

+16
-6
lines changed

1 file changed

+16
-6
lines changed

README.md

Lines changed: 16 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Macrobase Diff minimal implementation (WORK IN PROGRESS)
1+
# Macrobase Diff minimal implementation
22
This is a mnimial implementation of an idea from [DIFF: A Relational Interface for Large-Scale Data Explanation F.Abuzaid et al 2018](https://cs.stanford.edu/~matei/papers/2019/vldb_macrobase_diff.pdf).
33

44
In short: Given a table of numerical and categorical data and a query dividing the table into two groups (outliers/inliers) return attributes (categorical values) that are more common among the outliers (so called explanations).
@@ -31,10 +31,20 @@ Outliers:
3131
0 99.8 B A
3232
8 109.0 B B
3333
Explanations
34-
8.0 {'cat_col1': 'B', 'cat_col2': 'B'}
35-
3.5 {'cat_col1': 'B', 'cat_col2': 'A'}
36-
3.5 {'cat_col2': 'B'}
37-
0.2857142857142857 {'cat_col2': 'A'}
34+
score cat_col1 cat_col2
35+
-- ------- ---------- ----------
36+
0 8 B B
37+
1 3.5 - B
38+
2 3.5 B A
39+
Attribute combinations below thresholds
40+
cat_col1
41+
-- ----------
42+
0 B
3843
```
3944

40-
Please mind that this is still very much work in progress ..
45+
## Further Work
46+
The original Macrobase Diff provides more contributions:
47+
- Streaming implementation
48+
- SQL-like REPL interface (to showcase how it could be implemented within an SQL client)
49+
- Plenty of optimizations
50+
All of the above are worthwhile for follow-up work in this project.

0 commit comments

Comments
 (0)