How to conduct a data science investigation

The job of a Data Scientist, in general, is to take some (surprise, surprise) data from the real world and use a mathematical and computational method (what we call a Model) to extract useful knowledge from the data to help us make better decisions, or to solve a problem. Like other sciences, data science is conducted via an empirical research method or investigation.

During my internship here at Adarga I got to experience being a Data Scientist and conducted a data science investigation to solve a key problem required for adarga_bench^TM.

The process I followed in this investigation - analogous to that of other Data Scientists - went something like this:

Problem Decomposition

Most problems in Data Science (and Computer Science in general) cannot be tackled immediately as a whole - they are often ensembles of various, simpler sub-problems.

The first step is to decompose the task at hand into its more easily-manageable components. Other than the obvious benefit of helping us more easily solve a large problem, breaking a task down offers very useful insight into the structure required for its solution as well as it allows work to be better divided among a team of Data Scientists. In addition, it is common to find overlap in the sub-problems between seemingly unrelated large problems: a solution to one of these may be re-usable for a future investigation.

Research State-of-the-Art Models

While some Data Scientists focus on mathematics-heavy research to develop novel models, my investigation was more oriented towards finding how I can adapt some of this state-of-the-art data science research for my problem. This involved reading through a large variety of recent Data Science papers and publications I found on Google Scholar, ResearchGate, arXiv, etc. and producing a summarisation of each promising one.

The aspects of research papers that I looked for the most were: relevancy to my problem at hand, complexity of the models, and their claimed performance. It was very important for me to write about the research I was reading to consolidate my own understanding of these models, and to give an easier point of reference for both myself later in the investigation and for other Data Scientists at Adarga if it happens to be relevant to their work at some point.

Get a Benchmark Dataset

When a new model is proposed in Data Science, it is very important to benchmark it against existing models to get an objective measure of how much better or worse it is.

This involves using a dataset (a large, organised collection of data) to complete a specific task, measuring performance based on the results of the task and then comparing these performance measurements between models. The best benchmarking datasets/tasks are open ones that are easily accessible by other Data Scientists (remember: you are comparing your work to others’ in the community) and that include high-quality data, because there’s no point measuring how good your model is if the data you benchmark it against is inadequate for the task at hand. Over the years, several tasks have become the de facto standards in the community and its often easy to find a benchmark for your own investigations.

Get a Evaluation Framework

While it’s great to have a high-quality dataset and a task with many existing benchmarks to compare against, it is useless unless you also have some framework to consistently evaluate your models on them.

To take proper and consistent measures of performance, it is key to either implement your own or find an existing Evaluation Framework that will allow you to easily perform the same tests on a variety of models and collect results robustly. And, from an engineering perspective, it is good practice to make use of re-usable tools for such repetitive tasks to ensure reproducibility and prevent us from re-inventing the wheel.

Implement Models

Now we can finally get to the fun part - implementing our models!

We take the mathematics laid out in the papers we researched, and develop them in code - for which there are countless ways to do so. You can choose to build your model specifically to work with the dataset and complete the task you had chosen, or make a generalised implementation that can be re-used and plugged into many contexts and problems (remember: some different problems share an overlap in their sub-problems).

Evaluate and Tune Models

Using your evaluation framework, you can now see how well your models perform. You can use these results to benchmark them against existing models, or you can use them to improve those same models. In fact, in my own investigation, I spent more time fine-tuning and optimising my models than I did initially implementing them. This involved making small, progressive tweaks to my original model and re-evaluating them to check if these changes are beneficial or not.

Tuning is particularly needed for adapting a general-use model to be used for different tasks as optimal settings are task-dependent.

While this investigation process may seem reductionist for more senior Data Scientists (especially those conducting high-level research) - for those getting into the field, such as myself, it provides an easy-to-follow scientific method that ensures that investigations are conducted empirically and make use of good software engineering practices.