Table Augmentation in Data Lakes

A huge amount of data is every day produced, in various formats, and it is stored in large repositories, called data lakes. These data are produced by enterprises and corporations, as well as government agencies that shared their data following the Open Data Principles. These data can contain useful information about our daily life, making them valuable. The value of such a mass 0f data can be exploited in building predictive models, exploiting their knowledge, as well as in a discovery scenario, in which their relations are useful. The main problem of these repositories is the lack of a common schema: in fact, a user cannot rely on metadata and header pieces of information, since data are generally dirty. In particular, the lack of schema makes it difficult to find relations among tables, i.e., tables that can be joined. The join operator can be very useful in a scenario in which a user has a table and wants to augment the information in the table, by performing joins on tables in the data lake, adding relevant features for a data science task.

Our interested is devoted to proposing an index for tables, that allows for a fast search over joinable tables and ranking the candidates over augment of information.