You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+16-6Lines changed: 16 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Macrobase Diff minimal implementation (WORK IN PROGRESS)
1
+
# Macrobase Diff minimal implementation
2
2
This is a mnimial implementation of an idea from [DIFF: A Relational Interface for Large-Scale Data Explanation F.Abuzaid et al 2018](https://cs.stanford.edu/~matei/papers/2019/vldb_macrobase_diff.pdf).
3
3
4
4
In short: Given a table of numerical and categorical data and a query dividing the table into two groups (outliers/inliers) return attributes (categorical values) that are more common among the outliers (so called explanations).
@@ -31,10 +31,20 @@ Outliers:
31
31
0 99.8 B A
32
32
8 109.0 B B
33
33
Explanations
34
-
8.0 {'cat_col1': 'B', 'cat_col2': 'B'}
35
-
3.5 {'cat_col1': 'B', 'cat_col2': 'A'}
36
-
3.5 {'cat_col2': 'B'}
37
-
0.2857142857142857 {'cat_col2': 'A'}
34
+
score cat_col1 cat_col2
35
+
-- ------- ---------- ----------
36
+
0 8 B B
37
+
1 3.5 - B
38
+
2 3.5 B A
39
+
Attribute combinations below thresholds
40
+
cat_col1
41
+
-- ----------
42
+
0 B
38
43
```
39
44
40
-
Please mind that this is still very much work in progress ..
45
+
## Further Work
46
+
The original Macrobase Diff provides more contributions:
47
+
- Streaming implementation
48
+
- SQL-like REPL interface (to showcase how it could be implemented within an SQL client)
49
+
- Plenty of optimizations
50
+
All of the above are worthwhile for follow-up work in this project.
0 commit comments