Interview – ODF Sweden

Interview with Francis Freire from the PLAN-SUBSIM project

Felicia Ridderbjelke — Mon, 04 Apr 2022 11:17:19 +0000

Francis Freire works at the governmental agency Geological Survey of Sweden (SGU) and was interviewed by the ODF Sweden Team Members Yixin Zhang (Department of Applied IT at Gothenburg University, Responsible for WP 2 Continuous evaluation & innovation) and Felicia Ridderbjelke (Community Curator at ODF) for his contribution to the project PLAN-SUBSIM. The project is a national implementation of a PLatform for ANalysis of SUBSea IMages to develop methods for monitoring and analysing the status of the subsea habitats. The project will leverage existing methods, knowledge and infrastructure in the field of subsea image analysis and implement these for applications in marine resource management.

Francis is a marine geologist who surveys the Swedish coastal waters using hydroacoustic techniques and collecting sediment samples and high-resolution underwater images to produce full coverage benthic habitat maps. Analysing these data is extremely time consuming and ODF will therefor use our machine-learning approach to speed up and scale out the analysis of the surveys.

Yixin Zhang: Can you tell us a little bit about yourself and the project?

Francis Freire: I’m a marine geologist by background and use marine geophysical methods to study seafloor geology. Together with people from my department, we survey the Swedish coastal waters to determine what type of seafloor can be found where. We do a habitat mapping based on the data that we collect for the survey and that’s basically how we came in contact with the ODF. Swedish Geological Survey is the government agency assigned to do the geophysical surveys in the territorial waters within Sweden and we wanted to effectivese the surveying and to characterize the sea for geology.

When we go out into the field, we use different geophysical equipment and have our own boat. The trips can be around 2 weeks and we are about 10 people on the boat to map a particular place. We also do geophysical sampling, which means that we use acoustic multi-beam systems to determine and characterize the seafloor. We use other acoustic systems to investigate what’s underneath the seafloor, the subseafloor. We take samples of the sediments in the coastal waters and do a lot of chemical analysis to determine the level of toxins.

I am probably more interested in the habitat mapping part of the seafloor, the part where we record underwater videos and acoustic data. When we collect the data, we use machine learning and to come up with habitat maps of the area that we survey. For example, we had a big project in Hoburgs bank, which is on the Baltic part of Swedish water. It was a very comprehensive survey and we were there for almost two months.

Yixin Zhang: I once read that we actually know more about planet Mars compared to the seafloor. How little or much do we actually know about the seafloor?

Francis Freire: That is probably true. We try to map as much as we can and we already know quite a lot about Swedish waters. But overall, there are still a lot of areas that are not mapped. The biggest gap is probably not in Europe or in the US but in the bigger ocean areas, for example in the middle of the Atlantic.

Yixin Zhang: What are habitat mapping and geophysical surveying?

Francis Freire: Habitat mapping is when you try to identify all the habits that can be found in the waters. There are around 40 different classified habitats depending on the coverage of the area. An example of this is the project called Helcom in the Baltic Seawhere where the key component is habitat mapping. There are so many ways to define the habitats and I think it is our ethical mandate to also create this to follow this directive and create habitat maps.

Regarding geophysical surveying, that is when we go out into the field and collect geophysical data, mostly acoustic data. We send out a caustic pulse which will bounce back from the seafloor to the boat and give us information about the seafloor. We also measure the amount of sound that comes back which gives us an idea of the seafloor material. Regarding this acoustic system, we send out 10 or 15 samples for every square meter. We then get very dense data and information for even up to a decimetre.So every point one decimeter in the seafloor, we get an acoustic pole. Wa also collect underwater videos and photos and sometimes also just collect “real samples” by just going out in the ocean and grabbing anything that is there to see what materials can be collected.

Finally, we then use machine learning to interpolate all this data to get a more clear picture of the habitat maps. But briefly, from the acoustic data, we collect information for the whole area. But for the underwater videos, pictures, and samples, we only receive information for specific points. Then we interpret and combine all this data.

A report that describes the habitat mapping process: in the HELCOM project: High-resolution benthic habitat mapping of Hoburgs bank, Baltic Sea (2020), Gustav Kågesten, Finn Baumgartner, and Francis Freire. Available at:

A figure from the report (Kågesten, Baumgartner, and Freire, 2020) illustrates ocean surveying and the instruments.Figures from the report (Kågesten, Baumgartner, and Freire, 2020) illustrate the photo mosaic of seafloor habitat.

Yixin Zhang: What motivated you to join the Subsim-project?

Francis Freire: I think the motivation for the whole team was that we wanted to facilitate the processing of our collected underwater pictures and videos. The idea behind the project is to use an algorithm and a fast computer to do the identifying work for us. To collect our data and feed it all into a computer would save us a lot of time.

Yixin Zhang: Considering all the data that you are gathering, processing and analyzing, how many hours do you actually invest in analyzing a five or ten-minute long video?

Francis Freire: For example, we have collected close to 600 sampling points around the Hoburgs bank. For each of the photos, we need to find a way to upload it to our software and then identify the percentage for everything that covers the seafloor. For example, how large is the percentage of mussels? It probably takes around 30 minutes to one hour to analyze one picture. Then you also have to do some cross-checking afterward, so another person clarifies that the identification is right.

Figures from the report (Kågesten, Baumgartner, and Freire, 2020) illustrate camera and sensor set up, and underwater images mosaic.

Yixin Zhang: If I understand it right, you monitor 600 different observation points. How many pictures do you take per site?

Francis Freire: Depends on the area. The more diverse areas require more pictures. When the area is small, the resolution is high and the pictures become very clear. Before we start to photograph, we receive information from our multi-beam system or backscatter system that can give an idea about the depth.

Yixin Zhang: What challenges do you expect for similar projects in the future?

Francis Freire: For now, I think the biggest challenge is to be able to use artificial intelligence in our algorithm so we can make the identification even faster. If the algorithm can identify with good enough confidence, we can monitor more and bigger areas by feeding more pictures into our software and then process it effectively. This algorithm can then identify how much percent of the area was covered by for example algae so that we later can do the habitat mapping for the sites.

Yixin Zhang: Has machine learning been used in the work you do?

Francis Freire: I do not think we have enough manpower to do that just yet. We have been looking for some partners who can help us with this for a long time now, so we were happy to get in contact with ODF.

Felicia Ridderbjelke: From all of the photos you have collected, is there anything in particular that has surprised you?

Francis Freire: One interesting finding is that many seafloor covers are not permanent. The seafloor changes depending on the season. However, it is quite repetitive and looks similar for every season. We have also found a high percentage of mussels in Baltic areas, that differ from the species on the West Coast. Something that is not part of the project, but that interests me, is some of the shipwrecks that we have found.

Felicia Ridderbjelke: During which months have you done these trips?

Francis Freire: We started with the project in 2016 and our survey season starts from April until October.

Yixin Zhang: What are the challenges with your ocean trips?

Francis Freire: Well, you work 24 hours a day and if you get the night shift you work from 8 pm to 5 am. There was this one time, I think I threw up like three times during one trip. Also, the problem with our ship is that sometimes it makes a lot of noise which makes it hard to sleep. Sometimes the boat is shaking too much which makes it hard to eat. But when you have collected the data and produced the product, you get really satisfied and happy.

The post Interview with Francis Freire from the PLAN-SUBSIM project appeared first on ODF Sweden.

Interview with ODF Sweden data scientist Jurie “Jannes” Germishuys

Torsten Linders — Wed, 04 Mar 2020 21:58:07 +0000

Jannes is from Combine Control Systems AB, and he was interviewed by ODF Sweden Team Members, Yixin Zhang from the Department of Applied IT, Gothenburg University, and Adrian Bumann from the Entrepreneurship and Strategy Department at Chalmers University of Technology.

Yixin together wth Adrian are responsible for continuous evaluation within the ODF Sweden project and are conducting interviews as part of this process.

Yixin: Could you guide us through the process of the first ODF Sweden Innovation Cycle?

Jannes: Roughly six months ago, ODF Sweden started to work on our first innovation cycle focused on the use case: Predicting the presence of the invasive species Dikerogammarus Villosus (aka the Killer Shrimp) in the Baltic Sea region.

The process can be roughly broken down into the following:

Data collection and feature selection: The data were download from various open APIs including Emodnet and Marine Copernicus. Features were selected based on the input of marine experts which were: temperature, salinity, depth, substrate and wave activity.

Data preparation and cleaning: Missing data were removed, and features were visualized. In this case, we noticed that the data were very skewed towards the absence class, which meant there was extreme high-class imbalance. To address this, we used an oversampling method that increased the instances of the “presence” class by creating synthetic cases based on the original presence cases.

Setup of training and test sets: The training set is what we use to train the model, whilst the test set is an independent dataset used for evaluation. An 80/20 split was used in this case.

Choosing the model: Primarily tree-based models were used: a single decision tree and random forest (shown above). The main difference between them is, for example, if you ask a question to a single person, there is a smaller chance to get the question right (assuming they’re not an expert) than if you average the opinion of a whole group (just like in the show “Who wants to be a millionaire”).

Training the models: All the models chosen were trained with their standard configurations in scikit-learn and fast.ai Python libraries for easy replication.

Evaluating the models: The models were scored on their ability to correctly predict the locations where the killer shrimp would be present, which is termed recall.

Interpreting model output: Using our model, we are able to get a probability that a particular point belongs to our presence class and produce an interactive web app to showcase the outputs.

Continue until output is actionable: Throughout the entire process, we had to adapt our methods as new information became available and we learned more about the problem, which is almost always the case in machine learning problems.

Yixin: Why tree-based models?

Jannes: For several reasons:

Simplicity: No feature selection is needed (as they have been expertly chosen), no need to pre-process features (avoid unnecessary biases).

Interpretability: Black box methods seem great on paper but in practice they lack transparency when evaluating model output. Tree methods allow us to look into each decision and see what influenced its output.

Incremental models: We wanted to start-off simple and show the shortcomings of simple decision trees to justify more complex model choices such as Random Forest.

Hint of experience: Tree-based models have been shown to work well for tabular datasets such as in our case.

Yixin: Why did you add a deep neural network?

Jannes: We added a deep neural network to show the value of this method in extracting complex features.

Yixin: What were the challenges when working with ocean data?

Jannes: I think there are several challenges to consider.

First, I would say that ocean data can be quite intimidating. Working with geospatial information means not only looking at the data but looking at it in the right way.

One of our challenges was understanding coordinate reference systems (CRS), which determine where points are located on a map. Since the Earth is spherical, each CRS represents a projection onto a flat 2D surface for visualisation. We are all familiar with one such system, the latitudes and longitudes we see on our Google Maps, also known as WGS84 (or EPSG: 4326). But as it turns out, each data provider has its own favourite CRS and so re-projecting between these is often necessary when performing comparisons and calculations. Luckily, many Python packages such as GDAL and Rasterio help us to simplify this process.

Another major challenge was interpreting inland data. Since we had no information available about inland water sources, we had to match these to the closest ocean which proved to be difficult and inaccurate because we have to make assumptions such as “inland water is just as salty as sea water”. This led to large biases in our initial results and led us to revisit this assumption and ultimately abandon this when we obtained additional presence data in the Baltic Sea.

Yixin: What were the initial reactions when you presented the ML model to the team?

Jannes: I would say an equal mix of intrigue and confusion. Although the model results seem impressive, it is difficult to understand what these metrics mean until you have had an opportunity to work with the data and modelling yourself.

Yixin: Was it difficult to find the relevant data?

Jannes: The data exist and are plentiful on open data platforms. But the data lie on multiple siloed systems with no central access point or methodology, and the difficulty also lies more in extracting the relevant data in the correct format.

EMODnet Central Portal, https://www.emodnet.eu/portals

Yixin: As a data science expert, what do you consider as limitations of the ML solution in this context of predicting invasive species?

Jannes: The output from any ML model is only as good as its assumptions and the data used. So, one limitation of this model is that we have insufficient data to make high confidence predictions. It is also limited as a predictor using only data points and not entire grids, which could be useful as areas in the grid close to each other usually have a strong relationship to one another.

Yixin: Is there anything you would like to share with data scientists who start working with ocean data?

Jannes: I would say that data scientists should be very critical of any methods they are “comfortable” with when shifting to geospatial ocean data. For example, if you simply sample data points from a large area in the ocean and then split your datasets into training and test sets, the distribution of the training and test data will be so similar that the test set effectively “leaks” into the training set, which leads us to be overconfident in our model predictions.

Map of Killer Shrimp distribution in Baltic Sea

Yixin: What relation to the ocean and ocean data did you have before this project? How was your experience as data scientist, working with ocean data?

Jannes: I had never worked with ocean data specifically, so this was all rather new to me. It was very rewarding since ocean data expanded my toolkit to deal with a broader range of datasets and tools for future use cases (especially geospatial data).

Web application output: http://odf-open-data.herokuapp.com

Yixin: Is there anything you would like to share with ocean data experts who start learning AI?

Jannes: Always question the output of the model and trust your gut because as a subject expert you have the experience to judge what is reasonable.

Yixin: Looking back, what part of the work process was most time consuming?

Jannes: Extracting data from the respective sources took up the bulk of the time, since there is no central place to get all the information we need.

Yixin: What could you suggest to ocean data providers about how the data could be better prepared for use, in terms of accessibility, format, or other aspects?

One of our main goals within ODF Sweden is to encourage and enable FAIR data practices. This means that any and all data we use should be findable, accessible, interoperable and reusable. This includes open data sources through open APIs, open code sharing on Github and public notebooks on Kaggle. With this in mind, we would recommend that all data providers improve and align documentation standards. We also hope that datasets will become more searchable and that new datasets will be promoted to boost research efforts.

Yixin: Finally, the next step will be to publish parts of the ML model on Kaggle, an online ML-learning/problem-solving community. What results do you hope for?

Jannes: I hope to engage with a broad audience from diverse backgrounds who are interested in learning more about ocean data and data science or contributing their expertise and insights to build on and improve on our models. I also hope to showcase what we have done in ODF Sweden and to share our data and insights with a large and active online community.

The post Interview with ODF Sweden data scientist Jurie “Jannes” Germishuys appeared first on ODF Sweden.