Interview with ODF Sweden data scientist Jurie “Jannes” Germishuys
Jannes is from Combine Control Systems AB, and he was interviewed by ODF Sweden Team Members, Yixin Zhang from the Department of Applied IT, Gothenburg University, and Adrian Bumann from the Entrepreneurship and Strategy Department at Chalmers University of Technology.
Yixin together wth Adrian are responsible for continuous evaluation within the ODF Sweden project and are conducting interviews as part of this process.
Yixin: Could you guide us through the process of the first ODF Sweden Innovation Cycle?
Jannes: Roughly six months ago, ODF Sweden started to work on our first innovation cycle focused on the use case: Predicting the presence of the invasive species Dikerogammarus Villosus (aka the Killer Shrimp) in the Baltic Sea region.
The process can be roughly broken down into the following:
Data collection and feature selection: The data were download from various open APIs including Emodnet and Marine Copernicus. Features were selected based on the input of marine experts which were: temperature, salinity, depth, substrate and wave activity.
Data preparation and cleaning: Missing data were removed, and features were visualized. In this case, we noticed that the data were very skewed towards the absence class, which meant there was extreme high-class imbalance. To address this, we used an oversampling method that increased the instances of the “presence” class by creating synthetic cases based on the original presence cases.
Setup of training and test sets: The training set is what we use to train the model, whilst the test set is an independent dataset used for evaluation. An 80/20 split was used in this case.
Choosing the model: Primarily tree-based models were used: a single decision tree and random forest (shown above). The main difference between them is, for example, if you ask a question to a single person, there is a smaller chance to get the question right (assuming they’re not an expert) than if you average the opinion of a whole group (just like in the show “Who wants to be a millionaire”).
Training the models: All the models chosen were trained with their standard configurations in scikit-learn and fast.ai Python libraries for easy replication.
Evaluating the models: The models were scored on their ability to correctly predict the locations where the killer shrimp would be present, which is termed recall.
Interpreting model output: Using our model, we are able to get a probability that a particular point belongs to our presence class and produce an interactive web app to showcase the outputs.
Continue until output is actionable: Throughout the entire process, we had to adapt our methods as new information became available and we learned more about the problem, which is almost always the case in machine learning problems.
Yixin: Why tree-based models?
Jannes: For several reasons:
Simplicity: No feature selection is needed (as they have been expertly chosen), no need to pre-process features (avoid unnecessary biases).
Interpretability: Black box methods seem great on paper but in practice they lack transparency when evaluating model output. Tree methods allow us to look into each decision and see what influenced its output.
Incremental models: We wanted to start-off simple and show the shortcomings of simple decision trees to justify more complex model choices such as Random Forest.
Hint of experience: Tree-based models have been shown to work well for tabular datasets such as in our case.
Yixin: Why did you add a deep neural network?
Jannes: We added a deep neural network to show the value of this method in extracting complex features.
Yixin: What were the challenges when working with ocean data?
Jannes: I think there are several challenges to consider.
First, I would say that ocean data can be quite intimidating. Working with geospatial information means not only looking at the data but looking at it in the right way.
One of our challenges was understanding coordinate reference systems (CRS), which determine where points are located on a map. Since the Earth is spherical, each CRS represents a projection onto a flat 2D surface for visualisation. We are all familiar with one such system, the latitudes and longitudes we see on our Google Maps, also known as WGS84 (or EPSG: 4326). But as it turns out, each data provider has its own favourite CRS and so re-projecting between these is often necessary when performing comparisons and calculations. Luckily, many Python packages such as GDAL and Rasterio help us to simplify this process.
Another major challenge was interpreting inland data. Since we had no information available about inland water sources, we had to match these to the closest ocean which proved to be difficult and inaccurate because we have to make assumptions such as “inland water is just as salty as sea water”. This led to large biases in our initial results and led us to revisit this assumption and ultimately abandon this when we obtained additional presence data in the Baltic Sea.
Yixin: What were the initial reactions when you presented the ML model to the team?
Jannes: I would say an equal mix of intrigue and confusion. Although the model results seem impressive, it is difficult to understand what these metrics mean until you have had an opportunity to work with the data and modelling yourself.
Yixin: Was it difficult to find the relevant data?
Jannes: The data exist and are plentiful on open data platforms. But the data lie on multiple siloed systems with no central access point or methodology, and the difficulty also lies more in extracting the relevant data in the correct format.
EMODnet Central Portal, https://www.emodnet.eu/portals
Yixin: As a data science expert, what do you consider as limitations of the ML solution in this context of predicting invasive species?
Jannes: The output from any ML model is only as good as its assumptions and the data used. So, one limitation of this model is that we have insufficient data to make high confidence predictions. It is also limited as a predictor using only data points and not entire grids, which could be useful as areas in the grid close to each other usually have a strong relationship to one another.
Yixin: Is there anything you would like to share with data scientists who start working with ocean data?
Jannes: I would say that data scientists should be very critical of any methods they are “comfortable” with when shifting to geospatial ocean data. For example, if you simply sample data points from a large area in the ocean and then split your datasets into training and test sets, the distribution of the training and test data will be so similar that the test set effectively “leaks” into the training set, which leads us to be overconfident in our model predictions.
Map of Killer Shrimp distribution in Baltic Sea
Yixin: What relation to the ocean and ocean data did you have before this project? How was your experience as data scientist, working with ocean data?
Jannes: I had never worked with ocean data specifically, so this was all rather new to me. It was very rewarding since ocean data expanded my toolkit to deal with a broader range of datasets and tools for future use cases (especially geospatial data).
Web application output: http://odf-open-data.herokuapp.com
Yixin: Is there anything you would like to share with ocean data experts who start learning AI?
Jannes: Always question the output of the model and trust your gut because as a subject expert you have the experience to judge what is reasonable.
Yixin: Looking back, what part of the work process was most time consuming?
Jannes: Extracting data from the respective sources took up the bulk of the time, since there is no central place to get all the information we need.
Yixin: What could you suggest to ocean data providers about how the data could be better prepared for use, in terms of accessibility, format, or other aspects?
One of our main goals within ODF Sweden is to encourage and enable FAIR data practices. This means that any and all data we use should be findable, accessible, interoperable and reusable. This includes open data sources through open APIs, open code sharing on Github and public notebooks on Kaggle. With this in mind, we would recommend that all data providers improve and align documentation standards. We also hope that datasets will become more searchable and that new datasets will be promoted to boost research efforts.
Yixin: Finally, the next step will be to publish parts of the ML model on Kaggle, an online ML-learning/problem-solving community. What results do you hope for?
Jannes: I hope to engage with a broad audience from diverse backgrounds who are interested in learning more about ocean data and data science or contributing their expertise and insights to build on and improve on our models. I also hope to showcase what we have done in ODF Sweden and to share our data and insights with a large and active online community.