The project work will include the implementation of algorithms that realize steps of the entity resolution process.

Please work on groups of two (2).

[Part 1] Implement the following blocking methods: Token Blocking and Attribute Clustering Blocking. Please explain any assumptions you made (e.g., if you apply any stemming method on your data, which is the similarity measure that you use for computing similarities between sets of tokens (one similarity function, like Jaccard, is enough for the needs of our project), etc.). For details on the blocking methods please see at the paper: A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces.

[Part 2] Implement the meta-blocking method using as edge weighting the Jaccard scheme and the common blocks scheme, and as pruning scheme the weight edge pruning and the cardinality node pruning methods. For details on the meta-blocking methods please see at the paper: Meta-Blocking: Taking Entity Resolution to the Next Level.

Additional information:

  • Any programming language for your project is acceptable.
  • All groups will be examined on the project codes on March 7 and 8, 14.00-16.00 – Room: PINNI B2052 (More details will be announced later). For testing, please have with you either your laptop or the source codes.
  • For testing your codes please create 2 small clean datasets, each one containing around 100 to 150 entity descriptions, with 2 to 4 attribute-value pairs. Try to ensure that there exist some common tokens among the entity descriptions values.
  • The project will be accompanied with a short report (about 5 pages long), describing algorithms and implementations. Please send your report to before March 10, 2019.

