Entity resolution is a very common task in Big Data processing, where different entity profiles, usually described under different schemas, are mapped to the same real-world object. Beyond the deduplication and cleaning problems that appear in traditional data integration, such as data warehouses, entity resolution is a prerequisite for many Web applications, posing several challenges due to the volume and variety of the data collections. In general, entity resolution constitutes an inherently quadratic task; given an entity collection, each entity profile must be compared to all others.
In this course, we will focus on algorithmic approches for entity resolution in the Web of data. We will study approaches that aim to reduce the set of possible comparisons to be performed between data collections, like blocking and meta-blocking, and approaches that aim to minimize the number of missed matches via an iterative entity resolution process that exploits any intermediate results of blocking and matching in order to discover new candidate description pairs for resolution. Moreover, we will discuss works on progressive entity resolution, which attempt to discover as many matches as possible given limited computing budget, by estimating the matching likelihood of yet unresolved descriptions, based on the matches found so far.
After completing the course, the student is expected to:
- know the basic concepts and techniques for big data entity resolution, including blocking and meta-blocking techniques, and techniques for iterative and progressive entity resolution,
- be able to handle contemporary research issues and problems on big data entity resolution, and
- solve real-world problems.
Modes of study
Lectures, assignments, project, student presentations in class.
Course Work and Assessment
Assignments (3) (50%)
- Assignments will include short presentations answering questions on research papers related to the course.
- Students will be divided into presenters and reviewers. Both will study the same topic. In rounds, some of the students will make the presentation, and the rest will be responsible for reviewing the presentation, e.g., by explaining their viewpoint on the topic.
- The project will include implementation of algorithms that realize steps of the entity resolution process (create groups of 2).
- The project will be accompanied with a short report (5-6 pages long), describing algorithms and experimental evaluation.
- The project will be evaluated at the end of the period.