Extracting structured information from Wikipedia articles to populate infoboxes

Lange, Dustin; Böhm, Christoph; Naumann, Felix

Extracting structured information from Wikipedia articles to populate infoboxes

Dustin Lange, Christoph Böhm, Felix Naumann

Roughly every third Wikipedia article contains an infobox - a table that displays important facts about the subject in attribute-value form. The schema of an infobox, i.e., the attributes that can be expressed for a concept, is defined by an infobox template. Often, authors do not specify all template attributes, resulting in incomplete infoboxes. With iPopulator, we introduce a system that automatically populates infoboxes of Wikipedia articles by extracting attribute values from the article's text. In contrast to prior work, iPopulator detects and exploits the structure of attribute values for independently extracting value parts. We have tested iPopulator on the entire set of infobox templates and provide a detailed analysis of its effectiveness. For instance, we achieve an average extraction precision of 91% for 1,727 distinct infobox template attributes.
Ungefähr jeder dritte Wikipedia-Artikel enthält eine Infobox - eine Tabelle, die wichtige Fakten über das beschriebene Thema in Attribut-Wert-Form darstellt. Das Schema einer Infobox, d.h. die Attribute, die für ein Konzept verwendet werden können, wird durch ein Infobox-Template definiert. Häufig geben Autoren nicht für alle Template-Attribute Werte an, wodurch unvollständige Infoboxen entstehen. Mit iPopulator stellen wir ein System vor, welches automatisch Infoboxen von Wikipedia-Artikeln durch Extrahieren von Attributwerten aus dem Artikeltext befüllt. Im Unterschied zu früheren Arbeiten erkennt iPopulator die Struktur von Attributwerten und nutzt diese aus, um die einzelnen Bestandteile von Attributwerten unabhängig voneinander zu extrahieren. Wir haben iPopulator auf der gesamten Menge der Infobox-Templates getestet und analysieren detailliert die Effektivität. Wir erreichen beispielsweise für die Extraktion einen durchschnittlichen Precision-Wert von 91% für 1.727 verschiedene Infobox-Template-Attribute.

Metadaten
Author details:	Dustin Lange, Christoph Böhm, Felix Naumann ORCiD GND
URN:	urn:nbn:de:kobv:517-opus-45714
ISBN:	978-3-86956-081-6
Publication series (Volume number):	Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam (38)
Publisher:	Universitätsverlag Potsdam
Place of publishing:	Potsdam
Publication type:	Monograph/Edited Volume
Language:	English
Publication year:	2010
Publishing institution:	Universität Potsdam
Release date:	2010/11/17
Tag:	Informationsextraktion; Linked Data; Wikipedia Information Extraction; Linked Data; Wikipedia
Number of pages:	27
RVK - Regensburg classification:	ST 230
Organizational units:	An-Institute / Hasso-Plattner-Institut für Digital Engineering gGmbH
DDC classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Publishing method:	Universitätsverlag Potsdam
License (German):	Keine öffentliche Lizenz: Unter Urheberrechtsschutz

Extracting structured information from Wikipedia articles to populate infoboxes

Download full text files

Export metadata

Additional Services