The objective of this project was to implement the A-Priori algorithm to obtain the most frequent itemsets for a list of conditions for a large set of patients, obtaining then associations between conditions by extracting rules of the forms (X) -> Y and (X, Y ) -> Z. Another goal was to implement and apply LSH to identify similar news articles from a dataset.
This project was developed under the Mining Large Scale Datasets course of University of Aveiro.
For each k (2 or 3), run the following command, inside the /src/ directory:
spark-submit conditions.py <K> ../data/conditions.csv
For a sample run, execute:
spark-submit conditions.py <K> ../data/conditions_truncated.csv
The results can be found inside the /results/ directory.
Run the following command, inside the /src/ directory:
spark-submit lsh.py ../data/covid_news_truncated.json <R> <B>
This project's grade was 16,7 out of 20.
- Eduardo Santos: eduardosantoshf
- Pedro Bastos: bastos-01