THALIA can be used in two ways:
1. As a rich source of test data for integration problems exhibiting a
wide variety of syntactic and semantic heterogeneities, which we have
grouped into three categories:
Attribute heterogeneities (e.g., synonyms, simple and complex mappings,
union types, and heterogeneities due to language of expression), missing
data (e.g., nulls, virtual columns, and semantic incompatibility), and a
variety of structural heterogeneities. See our introductory
publication for a more detailed description.
2. As a benchmark for objectively evaluating the capabilities of
integration technology taking into account the correctness of the solution
as well as the amount of programmatic effort (i.e., the complexity of
external functions) needed to resolve any heterogeneities. Our benchmark
is currently comprised of twelve queries each requiring the resolution of
a particular type of heterogeneity.
Downloadable University course catalogs are represented using well-formed
and valid XML according to the extracted schema for each course catalog.
Extraction and translation from the original representation was done using
a source-specific wrapper which preserves structural and semantic
heterogeneities that exist among the different course catalogs. We have
used an enhanced version of the Telegraph
Screen Scraper (TESS) system developed at UC Berkeley to extract the
source data. The enhanced version,
DataExtractor (HTMLtoXML), can be obtained from
sourceforge.net
along with the examples used to extract data provided in THALIA.
DataExtractor (HTMLtoXML) tool provides added functionality over TESS
system and can store the extracted data in an XML file.
For each type of heterogeneity listed above, we have formulated a
benchmark query against two data sources from our testbed that requires a
particular integration activity in order to provide the correct result: a
reference schema, which is used to formulate the query, as well as a
challenge schema which exhibits the type of heterogeneity that is to be
resolved by the integration system.
Note, in some cases (e.g., Benchmark Query 9), a query may illustrate
additional types of heterogeneities that are showcased in other queries.
Queries are written in XQuery version 1.0.
Please note that integration systems that do not provide query processing
can still use the benchmark by providing an integrated schema over the two
data sources associated with each benchmark query.
Users can browse both the repository of cached
course catalogs in their original representation as well as our collection
of extracted XML documents before running
the benchmark. Users can also download the
DataExtractor (HTMLtoXML) wrapper tool from
sourceforge.net
along with the examples used for THALIA.
In order to exchange information about the capabilities of existing
integration systems, users are encouraged to upload
the outcome of their benchmark evaluation. |