Hello! My name is Malwina Jurkowska and I’m a Data Scientist at Edit. My job is putting data to use; helping a business understand how its performing, spotting useful patterns in data and trying to predict what the future might hold are all in my wheelhouse. I’ve been a Data Scientist for five years now and I’ve worked on all sorts of exciting projects such as film recommendations, customer clustering and modelling customer behaviours.
One of the most important and surprising parts of my job is explaining what I do, not just to curious family, but also to clients and people I work with. Terminology and jargon can be rather scary and so in this blog I will de-mystify some of the terms data professionals use and hopefully make your Halloween a little less scary!
- Data – it is everywhere! When thinking about data, many people imagine a spreadsheet with numbers – and this is indeed data! But there are so many more types of data than most people imagine. We can broadly categorise data into two main categories:
- Structured/Tabular data – the pumpkin spice! These are your classical tables full of information, where each data point has a defined structure and so you can reliably query it. It is estimated to only account for a small proportion of all available data in the world, but this format best lends itself best to analysis and Machine Learning.
- Unstructured data – this can be just about anything, from images, texts, video and voice clips. It is data that doesn’t have the same set structure between data points and we will often attempt to convert unstructured data into structured data, because the lion’s share of data science techniques only work with structured data.
- Datum/Observation/Point/Row – this is a single piece of data, or a row in a table. For example, if your dataset contains movies made in the 1980s, then Ghostbusters is an example of a data point.
- Variables/Features/Fields/Columns – this is anything that might describe a data point, or a column in a dataset. For example, if your dataset contains movies made in the 1980s a feature might be the movie title, year of release, director and duration.
- Data Analyst/Scientist/Engineer – although there are many more kinds of data professionals, these are the main types – bear in mind that every organisation is different and often a data professional will dip into all three roles at some point!
- Data Analysts are concerned with describing the data landscape and delivering insights. They will often have skills required to produce stunning visualisations and dashboards!
- Data Scientists are mainly concerned with Machine Learning and building models. They combine the descriptive nature of data analysis with modelling to deliver predictions and segmentation to aid decision-making
- Data Engineers deal with data infrastructure and productionising. They maintain databases, pipelines and other resources required to keep everything running easy as pumpkin pie!
- ML or Machine Learning – sounds scary, but Machine Learning is simply using algorithms and statistical methods to create models which are able to learn patterns and provide insights without needing explicit instructions. The type of ML we use depends on the problem we’re trying to solve.
- Supervised learning – This is Machine Learning applied to a problem which we have an answer to. For example, based on previous years’ Trick-or-Treating experience, we want to predict which houses will give the most sweets this year!
- Unsupervised learning – This is Machine Learning applied to more open-ended problems. The most common problem in Unsupervised Learning is clustering, or trying to create meaningful groups in a population. In our Trick-or-Treating example, we may want to group houses according to common characteristics and visit similar houses all at once!
- Reinforcement learning – This is Machine Learning applied to a problem where we get live feedback based on the actions chosen, with an aim to maximise some kind of reward. In our delicious Trick-or-Treating example, we might have a live algorithm learning according to how we approached a house and whether we got our sweet reward! Eventually, the algorithm would be able to advise us accurately what to do for 100% effective Trick-o-Treating!
- Model – a model in data science is something that generalises a dataset into a pattern that can be applied to future data or something that extracts meaning from the data. A model is fed variables and usually outputs something we want to know about each data point. For example, our clients often ask for a model that might predict whether a particular customer will “churn” i.e. stop buying their products, based on what they already know about the customer.
- Not to be confused with a Data Model! That is a representation of how tabular data links together!
- Deep Learning is an area of Machine Learning concerned with building Neural Networks, which are a particularly complex family of Machine Learning models. Neural Networks are algorithms inspired by the human brain, mimicking the way that neurons signal to one another.
Figure 1 A Neural Network
- Hyperparameters – these are essentially settings that we need to choose when building models. An example of a hyperparameter when building a model which puts people into similar-minded groups, is the number of groups I’d like the model to return. Data Scientists often talk about “Hyperparameter Tuning” which is the process of finding what values of hyperparameters yield the best performing model!
- EDA – this is Exploratory Data Analysis and it is the process of understanding data and it must always precede model building. Data Analysis usually leads to some sort of a direct output, like a dashboard or a presentation, but EDA simply feeds into making the model build as good as it can be.
- Feature Engineering – remember when I mentioned features or columns? Well, this is the process of making new ones, sometimes using already existing features and sometimes using other sources.
Thanks for reading and Happy ‘Spooktober’ to you all!