automatic vectorisation of historical maps

International workshop organised by the ICA Commission on Cartographic Heritage into the Digital

Budapest – 13 March, 2020

Abstracts

Drolias Garyfallos Chrysovalantis, Tziokas Nikolaos

Building footprint extraction from historic maps utilizing automatic vectorization methods in open source GIS software

University of the Aegean, Greece

Keywords: automatic vectorization; historical maps; open source GIS; shape recognition

Summary: Historical maps are a great source of information about city landscapes and the way it changes through time. These analog maps contain rich cartographic information but not in a format allowing analysis and interpretation. A common technique to access this information is the manual digitization of the scanned map but it is a time-consuming process. Automatic digitisation/vectorisation processing generally require expert knowledge in order to fine-tune parameters of the applied shapes recognition techniques and thus are not readily usable for non-expert users. The methodology proposed in this paper offers fast and automated conversion from a scanned image (raster format) to geospatial datasets (vector format) requiring minimal time for building footprint extraction. The vector data extracted quite accurate building footprints from the scanned map, but the need for further post-processing for optimal results, is imminent.

Chenjing Jiao, Magnus Heitzler, Lorenz Hurni

Extracting Wetlands from Swiss Historic Maps with Convolutional Neural Networks

Institute of Cartography and Geoinformation, ETH Zürich, Switzerland

Keywords: Siegfried map, Fully Convolutional Neural Networks, wetland reconstruction

Summary: Historical maps contain abundant and valuable cartographic information, such as building footprints, hydrography, objects of historical relevance, etc. Extracting such cartographic information from the maps will be to the benefit of multiple applications and research fields. Specifically, extracting wetland based on historic maps facilitates researchers to investigate the spatio-temporal dynamics of hydrology and ecology. However, conventional extraction approaches such as Color Image Segmentation (CIS) are not perfectly suitable for historic maps due to the often poor graphical and printing quality of the maps. In recent years, studies that employ deep learning methods, especially Fully Convolutional Neural Networks (FCNN), provide an efficient and effective way to extract features from raster maps. Yet, FCNNs require a large amount of training data to perform sufficiently well. However, producing training data manually might not be an option due to a lack of resources, which motivates the search for alternative sources of training data. In a recent real- world use case, wetlands for the whole Switzerland for the year 1880 are to be extracted for a Non- governmental Organization (NGO) from the Swiss Siegfried Map Series (1872 to 1949) to carry out a study on landcover change. Since data for this year was not available, data from another study based on the year 1900 has been used for training [1]. Naturally, both datasets come from somewhat different distributions. The symbolization may have changed slightly between the two periods, and the map sheets of the training dataset have been georeferenced in a different way from the sheets of 1880. We developed an FCNN-based method to extract wetlands from the maps. Each map sheet is subdivided into a grid of tiles with the size of 200x200 pixels to be processable by the FCNN. Furthermore, to avoid misclassifying the border pixels, the extent of the input tile is expanded by 60 pixels on each side. The model used for the wetland segmentation is based on the U-Net architecture [2] and was modified according to [3]. The U-Net comprises two diametrical network paths, the contracting path and the up-sampling path. The contracting path consisting of convolution layers and max-pooling layers generates increasingly higher abstract representations of the input image. The up-sampling path infers the class of a pixel based on its position and its surroundings by gradually combining these features on different granularities. Once the segmentation results are generated, vectorization is performed using functions from the Geospatial Data Abstraction Library (GDAL) to convert the raster binary results into vector geometries. Generalization is implemented using the Douglas–Peucker algorithm, which traces the line segments of the original polyline and only keeps the start point, end point, and the farthest point of the current line segment. The vectorized and generalized geometries can be manually corrected to yield high-quality wetland vector layers. With this method, wetlands are extracted and vectorized based on the selected 573 historic map sheets across Switzerland. Future work will not only focus on the wetland extraction, but also on other hydrological features on the historic maps.

[1] Stuber, M. and Bürgi, M. (2018). Vom "eroberten Land" zum Renaturierungsprojekt. Geschichte der Feuchtgebiete in der Schweiz seit 1700. Bristol-Schriftenreihe: Vol. 59. Bern: Haupt Verlag.

[2] Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation, Cham. In International Conference on Medical image computing and computer-assisted intervention (pp. 234-241). Springer, Cham.

[3] Heitzler, M., & Hurni, L. (Accepted.) Cartographic Reconstruction of Historical Building Footprints from Historical Maps – A Study on the Swiss Siegfried Map. Tansactions in GIS.

Anna Piechl

(Semi-)automatic vector extraction of administrative borders from historical raster maps

Austrian Academy of Sciences, Vienna

Summary: Historical raster maps can be a useful source of information and allow us to travel backwards in time. The HistoGIS Project wants among other things also the creation of geodata from historical political/administrative units to allow queries afterwards with this data. Therefore, already existing vector data is used but also scanned historical raster maps from the 19th and early 20th century.

When it comes to the creation of new vector data the use of historical maps is inevitable. But the creation and extraction of vector data from raster maps can be very challenging. For sure polygons and attributes can be created manually by heads-up digitizing. Since this process can be very time intense and tedious, the aim was to find at least semi-automatic solution for extraction and vectorization of the administrative borders from the old maps. Therefore, different software tools and processes were tested. Commercial ones, as well as non-commercial ones whereas we have to say, that in the HistoGIS Project we focus on using open-source software. The main source of the maps is the David Rumsey Maps Collection which offers a great variety of scanned historical maps in different image resolutions. During the tests we figured out that it is not possible to use the tools without any pre-processing which includes image enhancement as well as image segmentation. This helps e.g. to get rid of noise or reduce or to improve the picture quality and in case of image segmentation to isolate the features of interest, the administrative borders. As most of the maps are from different sources, they have different qualities and colours. Hence it is important to consider these differences while using different maps when it comes to the parameter settings. These findings of different tools also let us know that because of the variety of the historical maps there is no one- size-fits-all automated solution for the vectorization process. Although some of the tools show usable results, other issues come along when using the vectorized geodata such as inaccuracy and uncertainty. These problems are caused by small scales, strong generalization of the old maps and by the motivation of the cartographer who did the map.

Daniel Laumer, Hasret Gümgümcü, Magnus Heitzler, Lorenz Hurni

A semi-automatic Label Digitization Workflow for the Siegfried Map

Institute of Cartography and Geoinformation, ETH Zurich, Switzerland

Keywords: historical maps, vectorization, deep learning, convolutional neuronal network, label extraction

Summary: Digitizing historical maps automatically offers a multitude of challenges. This is particularly true for the case of label extraction since labels vary strongly in shape, size, orientation and type. In addition, characters may overlap with other features such as roads or hachures, which makes extraction even harder. To tackle this issue, we propose a novel semi-automatic workflow consisting of a sequence of deep learning and conventional text processing steps in conjunction with tailor-made correction software. To prove its efficiency, the workflow is being applied to the Siegfried Map Series (1870–1949) covering entire Switzerland with scales 1:25.000 and 1:50.000.

The workflow consists of the following steps. First, we decide for each pixel if the content is text or background. For this purpose, we use a convolutional neuronal network with the U-Net architecture which was developed for biomedical image segmentation. The weights are calculated with four manually annotated map sheets as ground truth. The trained model can then be used to predict the segmentation on any other map sheet. The results are clustered with DBSCAN to aggregate the individual pixels to letters and words. This way, each label can be localized and extracted without background. But since this is still a non-vectorized representation of the labels, we use the Google Vision API to interpret the text of each label and also search for matching entries in the Swiss Names database by Swisstopo for verification. As for most label extraction workflows, the last step consists of manually checking all labels and correcting possible mistakes. For this purpose, we modified the VGG Image Annotator to simplify the selection of the correct entry.

Our framework reduces the time consumption of digitizing labels drastically by a factor of around 5. The fully automatic part (segmentation, interpretation, matching) takes around 5-10 min per sheet and the manual processing part around 1.5–2h. Compared to a fully manual digitizing process, time efficiency is not the only benefit. Also the chance of missing labels decreases strongly. A human cannot detect labels with the same accuracy as a computer algorithm.

Most problems leading to more manual work occur during clustering and text recognition with the Google Vision API. Since the model is trained for maps in a flat part of German-speaking Switzerland, the algorithm performs poorer for other parts. In Alpine regions, the rock hachures are often misinterpreted as labels, leading to many false positives. French labels are often composed of several words, which are not clustered into one label by DBSCAN. Possible further work could include retraining with more diverse ground truth or extending the U-Net model so that it can also recognize and learn textual information.

Alexandre Nobajas

Targeted crowdsourced vectorisation of historical cartography

School of Geography, Geology and the Environment, Keele University, United Kingdom

Keywords: crowdsourcing, gamification, systematic vectorisation, training set creation

Summary: Many institutions have systematically digitised their cartographic documents, so there are now millions of maps which have been digitised and can be accessed on-line. The next step many map libraries have undertaken has been to georeference their digitised maps, so they can now be used within modern digital cartographic dataset. However, many of these applications still use the raster version of the maps, so they hamper the full potential working with historical cartography has, as it limits the analytical capacity and impedes performing spatial analysis. Even though there have been timid steps to overcome this limitation, the discipline seems to be struggling to advance into the next step, the systematic vectorisation of historical maps. Vectorisation is the process of converting pixel-based images –raster- into node or point based images –vector, which can then be queried or analysed by individual components. Vector based maps have a series of advantages when compared to raster maps, as they allow scale changes without a loss in detail, classifying map features by type or performing spatial queries just to mention a few. Therefore, by vectorising a historical map we are providing an unprecedented level of usability to it and allowing detailed inquiries to the information contained into it. Although there have already been some successful experiences in vectorising historical maps, they have been either expensive or not systematic, so there is scope to build upon these successful experiences and achieve a widespread method of vectorisation which can be applied to map collections across the world.

There are different approaches to overcome this technological and methodological hurdle, most of them linked to the use of automation techniques such as using remote sensing land classification methods or, more recently, the use of machine learning and artificial intelligence techniques. However, a large number of training sets are necessary in order to train the automation systems and to benchmark the outputs, so large datasets are still needed to be generated. The use of crowdsourcing methods has been proven as a successful means of achieving the same goal, as experiences such as the ones developed by the New York Public Library demonstrate. If given the possibility and the right tools, people will help to vectorise historic documents, although Nielsen’s 90-9-1 rule will apply, meaning that other ways of increasing participation are necessary in order to speed the vectorisation up.

This paper proposes a framework in which by using a variety of gamification methods and by targeting two different groups of people, retirees and school children, historic maps can be massively vectorised. If fully deployed, this system would not only produce a wealth of vectorised maps, but it would also be useful in teaching school children important disciplines such as cartography, geography and history while at the same helping older generations keeping them mentally active. Finally, an intergenerational dialogue that would enrich both participant groups would be possible thanks to their common participation in the vectorisation of historical documents.

Jonas Luft

Automatic Georeferencing of Historical Maps by Geocoding

CityScienceLab, Hafencity University, Hamburg

Summary: Libraries and researchers face a big challenge: sometimes thousands, millions of maps are conserved in archives to retard deterioration, but this makes it hard to obtain specific knowledge of their content, sometimes even of their existence. In recent years, the efforts have increased to digitise historical documents and maps to preserve them digitally, make them accessible and allow researchers and the general public a less restricted access to their heritage.

But maps are more than a cultural artifact: they are data. Data about the past that can very well be important for science and decisions today. The problem is, the vast amount alone makes it hard to know what to look for. Maps are usually archived and catalogued with limited meta-information, sometimes obscuring their actual content. This makes it impossible to find specific information without expert knowledge on history and cartography.

Extracting the content, i.e. data, is a necessity to make the map content searchable and to use modern digital tools for automatic processing and analyses. For combining data from different maps and comparing them with modern geospatial data, georeferencing is paramount. This is usually done by hand and needs a lot of time and specialised training.

We explore if and how we can automatically find usable GCP in historical maps with computer vision and document analysis methods of today. In this work-in- progress report, we use OCR on map labels and geocoding of persisting geographical feature designation for successful first experiments. This shows text recognition and vectorisation to be a promising research direction for large- scale automated georeferencing of historical maps.

Mátyás Gede, Valentin Árvai, Gergely Vassányi, Zsófia Supka, Enikő Szabó, Anna Bordács, Csaba Gergely Varga, Krisztina Irás

Automatic vectorisation of old maps using QGIS – tools, possibilities and challenges

Department of Cartography and Geoinformatics, ELTE Eötvös Loránd University, Budapest

Keywords: map vectorisation, QGIS, Python

Summary: The authors experimented with the open source GIS software QGIS to reveal its possibilities in automatic vectorisation. The target object was a 1 : 200 000 map sheet of the third Austro- Hungarian military survey. Although QGIS provides tools for the main steps of the vectorisation process – such as colour segmentation of the raster image, skeletoning, vector extraction and vector post-processing – using them on an old map raises several challenges.

This paper discusses the results of these experiments, and introduces a few simple python scripts that may help users in the vectorisation process.

Geoff Groom, Gregor Levin, Stig Svenningsen, Mads Linnet Perner

Historical Maps – Machine learning helps us over the map vectorisation crux

Department of Bioscience, Aarhus University, Denmark; Department of Environment Science, Aarhus University, Denmark; Special Collections, Royal Danish Library, Copenhagen, Denmark

Keywords: Denmark, topographic maps, land category, OBIA, CNN, map symbols

Summary: Modern geography is massively digital with respect to both map data production and map data analysis. When we consider historical maps, as a key resource for historical geography studies, the situation is different. There are many historical maps available as hardcopy, some of which are scanned to raster data. However, relatively few historical maps are truly digital, as machine-readable geo-data layers. The Danish “Høje Målebordsblade” (HMB) map set, comprising approximately 900 sheets, national coverage (i.e. Denmark 1864-1899), and geometrically correct, topographic, 1:20,000, surveyed between 1870 and 1899, is a case in point. Having the HMB maps as vector geo-data has a high priority for Danish historical landscape, environmental and cultural studies. We present progress made, during 2019, in forming vector geo-data of key land categories (water, meadow, forest, heath, sand dune) from the scanned HMB printed map sheets. The focus here is on the role in that work of machine learning methods, specifically the deep learning tool convolutional neural networks (CNN) to map occurrences of specific map symbols associated with the target land categories. Demonstration is made of how machine learning is applied in conjunction with pixel and object based analyses, and not merely in isolation. Thereby, the strengths of machine learning are utilised, and the weaknesses of the applied machine learning are acknowledged and worked with. Symbols detected by machine learning serve as guidance for appropriate values to apply in pixel based image data thresholding, and for control of false positive mapped objects and false negative errors. The resulting map products for two study areas (450 and 300 km2) have overall false-positive and false-negative levels of around 10% for all target categories. The ability to utilise the cartographic symbols of the HMB maps enabled production of higher quality vector geo-data of the target categories than would otherwise have been possible. That these methods are in this work developed and applied via a commercial software (Trimble eCognition©) recognizes the significance of a tried-and-tested and easy-to-use, graphical-user-interface and a fast, versatile processing architecture for development of new, complex digital solutions. The components of the resulting workflow are, in principle, alternatively usable via various free and open source software environments.

Marguerite le Riche

Identifying building footprints in historic map data using OpenCV and PostGIS

Registers of Scotland

Summary: The House Age Project (pilot) is a Scottish public sector collaboration between the Registers of Scotland, Historic Environment Scotland and the National Library of Scotland. The aim of the project is to develop a methodology to (automatically) extract high quality building footprints from a time- series of scanned historic maps; and to reconstruct life-cycle information for individual buildings by identifying and tracking them between datasets.

In order to achieve these ambitious aims, a number of research problems are being investigated using combined GIS, computer vision and entry-level machine learning approaches. The tools used to date are open source and include python, in particular the computer vision library OpenCV, Geospatial Data Abstraction Library (GDAL) algorithms, PostgreSQL and its spatial extension PostGIS and QGIS.

The source data are georeferenced, 600 dpi scans of the Ordnance Survey’s 25-inch-to-the-mile County Series map sheets for the historic Scottish county of Edinburghshire, provided by the National Library of Scotland. The maps were produced between the mid 1800’s and mid 1900’s at a scale of 1:2500 and include detailed information about the built environment at the time of each survey.

This paper describes a GIS-oriented approach, whereby the raster images are transformed into georeferenced PostGIS geometries (via OpenCV contours) early in the analytical process, after an initial image preprocessing and segmentation has been performed using OpenCV. The identification of buildings, joining of sheets and post-processing is then undertaken using Structured Query Language (SQL) queries passed to the database using the python psycopg2 module. This approach makes attaching metadata to individual buildings easy and helps to overcome or partially addresses some of the challenges in extracting precise building footprints from the source data. For example, using Structured Query Language (SQL) allows great flexibility in creating new attributes for and establishing spatial relationships between selected geometries (created from OpenCV contours) in the same or adjoining map sheets; this makes it possible to obtain high quality building polygons despite significant variations in print quality; and implementations of the OGC Simple Features Standard makes it possible to automatically resolve some of the unwanted text and artefacts created during skeletonisation.

The limitations of the approach is described. For example the geometries created from the OpenCV contours are highly complex, which can cause unexpected results.

Problems that have resisted a fully automated solution using the GIS approach (or an OpenCV approach) will be listed. These problems include effects caused by some of the original cartographic choices, e.g. joining up buildings that are bisected by overprinted, joined-up text, reconstructing walls that are indicated by dashed lines; as well as issues caused by the age of the original paper map sheets, e.g damage and warping at sheet edges.

To date, it has not been possible to develop a fully automated extraction process due to the resistant problems described above, but a semi-automated process has yielded preliminary building footprints for the study area for the 3 rd and 4 th edition maps, with the 5 th edition to follow.

Kenzo Milleville, Steven Verstockt, Nico Van de Weghe

Improving Toponym Recognition Accuracy of Historical Topographic Maps

IDLab, Ghent University - Imec, Ghent, Belgium; CartoGIS, Ghent University, Ghent, Belgium

Keywords: automatic vectorization; historical maps; open source GIS; shape recognition

Summary: Scanned historical topographic maps contain valuable geographical information. Often, these maps are the only reliable source of data for a given period. Many scientific institutions have large collections of digitized historical maps, typically only annotated with a title (general location), a date, and a small description. This allows researchers to search for maps of some locations, but it gives almost no information about what is depicted on the map itself. To extract useful information from these maps, they still need to be analyzed manually, which can be very tedious. Current commercial and open-source text recognition systems fail to perform consistently when applied to maps, especially on densely annotated regions. Therefore, this work presents an automatic map processing approach focusing mainly on detecting the mentioned toponyms and georeferencing the map. Commercial and open-source tools were used as building blocks, to provide scalability and accessibility. As lower-quality scans generally decrease the performance of text recognition tools, the impact of both the scan and compression quality was studied. Moreover, because most maps were too large to process as a whole with state-of-the-art recognition tools, a tiling approach was used. The tile and overlap size affect recognition performance, therefore a study was conducted to determine the optimal parameters.

First, the map boundaries were detected with computer vision techniques. Afterward, the coordinates surrounding the map were extracted using a commercial OCR system. After projecting the coordinates to the WGS84 coordinate system, the maps were georeferenced. Next, the map was split into overlapping tiles, and text recognition was performed. A small region of interest was determined for each detected text label, based on its relative position. This region limited the potential toponym matches given by publicly available gazetteers. Multiple gazetteers were combined to find additional candidates for each text label. Optimal toponym matches were selected with string similarity metrics. Furthermore, the relative positions of the detected text and the actual locations of the matched toponyms were used to filter out additional false positives. Finally, the approach was validated on a selection of 1:25000 topographic maps of Belgium from 1975-1992. By automatically georeferencing the map and recognizing the mentioned place names, the content and location of each map can now be queried.

Előd Biszak, Gábor Timár

First vector components in the MAPIRE: 3D building modelling and Virtual Reality view

Arcanum Database Ltd., Budapest, Hungary; Dept. of Geophysics and Space Science, ELTE Eötvös Loránd University, Budapest, Hungary

Summary: MAPIRE is an originally raster-based portal, presenting scanned and geo-referred historical topographic, cadastral and town maps, as overlays on modern cartographic datasets, such as OpenStreetMaps or HereMaps. Its first 3D option was adopted via the SRTM elevation dataset, still on raster base, showing 3D panoramas from the viewpoint, using the historical map layer. Here we introduce the first vector component of the MAPIRE, developed in Cesium software environment. To improve the city and town maps, we applied a new 3D building modelling utility (MapBox Vector Tiles). The details of the historical buildings – varying in different time layers – are reconstructed from archive plans as well as photographic heritage provided by the Arcanum (the host of MAPIRE) and historical postcards. Unrevealed building parts are defined according to the ‘usual outline’ of the given age. Using this technology, we are able to provide virtual 3D tours for the users of MAPIRE, first in the frame of the ‘Budapest Time Machine’ project. Besides, the Virtual Reality (VR) tour is available also in the service, still in raster basis: as a pilot project, we can follow pre-recorded paths on the historical map using VR glasses.

Back to workshop website

Website hosted by Department of Cartography and Geoinformatics, ELTE Eötvös Loránd University