Anyone who wants to pursue a career in data science should study the following topics and knowledge to position themselves well. This article provides a general overview.
Data Science – or Data & Analytics is a broad field. Data Science is often referred to as the “sexiest job of the 21st century”, which I dare to doubt. However, those who prepare and evaluate the data cannot do their work without the data engineers. These are the people who make sure the data gets to where it needs to be in the first place. It’s a far less glamorous job. But without data engineering up front, stable data lines, clean data that you always find where it belongs and when you need it, data science is not possible at all. This is completely underestimated by many and often forgotten.
In a scientific environment, things may look different.
Methods for dimension reduction
In short, dimensionality reduction is about transforming a data set so that it is less complex, i.e. has fewer dimensions. On the one hand, this should make it easier to understand, but on the other hand, it should also make it easier to analyze. The reduction of the dimensions must ensure that the data set continues to provide information that is as similar as possible. The analysis results should not be distorted by the dimension reduction, i.e. the relevance of the data must still be given afterwards. Dimension reduction makes it much easier to visualize correlations and differences between different groups.
A good dimension reduction separates the unimportant data for an analysis from the important data and eliminates them from the data set. This reduces its complexity and makes it easier for humans and algorithms to analyze it. Good data visualization also represents dimension reduction. It allows us to focus on the variables that we know or suspect will do the most to clarify the issue under investigation.
From the point of view of statistics and data analysis, dimension reduction attempts to reduce the number of random variables in a data set. There are a number of different methods for this which are listed here in no order and without any claim to completeness.
- Simple data aggregations (sums, averages, etc.) for data set description and generation of initial insights, a method not to be underestimated
- Missing value analyses
- Principal component analyses
- Factor analyses
- Cluster analyses
- Regression analyses
- Random Forest Analyses
- Decision Trees
- Canonical correlation analyses
- Low Variance Analyses
- Multidimensional scaling
- Correspondence Analyses
Data visualisation
As more and more data comes to us, it is all the more important to make these data volumes understandable and interpretable. On the one hand, this is done through the already mentioned dimension reduction. To present the data simply and clearly in order to obtain fundamental insights, to understand connections and to be able to recognise trends is an art form and a skill which needs year long training. In daily cooperation, this leads to an improved exchange of information between different teams.
Data visualisation enables decision-makers at all levels to gather information quickly and in a targeted manner. Ideally, interactive diagrams enable them to carry out their own in-depth analyses. The goal is to make robust decisions based on business data. Properly prepared, data visualisations enable non-experts to understand and think through complex matters.
A well-crafted data visualisation translates data into a form that is easier to understand, removes information that is irrelevant and makes the important and useful information visible. Ideally, the data visualisation also tells a story that interests and engages the viewer and gives them options to decide / react.
The most important business tools for data visualisation include Tableau, PowerBI and Qlik Sense. These tools offer a comprehensive portfolio for professionally analysing and visualising data in a corporate environment. Excel and PowerPoint are also among the very frequently used visualisation tools. Office licences are available in most companies and whatever is cost-effective and available is used. In Big Data environments, however, there is usually no getting around the business tools mentioned first.
Methods for the classification of data
In times of ever increasing amounts of data, the classification of this data is becoming more and more important. Put simply, this is about methods, processes and tools that help organise data into different categories. This simplifies the later use of the data because it makes it easier to find and retrieve.
Usually, metadata is used for the classification of data, which is attached to the original data via tagging processes. In addition to making data easier to find, classification also contributes to data security in companies, for example by classifying data into public, restricted and private data and applying different checking and protection methods to them accordingly.
In the corporate context, we will try to automate data classification as much as possible. In addition to the use of regular expressions and machine learning algorithms, there are many commercial providers offering solutions in this area.
Regression modelling
Linear regression analysis is one of the most commonly used statistical analysis techniques in marketing. In simple terms, a regression examines how well the values of one variable can be predicted using one or more other variables. We look at the relationship between variables using a predictive function. The stronger the correlation between variables, the better that variable can be predicted using the other variables.
In simple linear regression, I consider only one predictor variable, the predictor. In multiple linear regression, on the other hand, several influencing factors are considered and my analysis becomes more accurate, if necessary, because I can elucidate more variance of the dependent variable, the criterion.
Regression analyses are used to examine different criteria of marketing campaigns. For example, it can be analysed to what extent different age groups differ in their online shopping behaviour, how many advertising media contacts are necessary for an optimal advertising effect or to what extent additional sales are possible by increasing or shifting the media budget.
Discriminant analysis
Discriminant analysis is used whenever group differences need to be investigated. The main questions are usually whether the different groups differ significantly from each other and which group characteristics are suitable or unsuitable for these distinctions. Discriminant analyses start with the definition of the groups. The group definition may already be predetermined by the business problem to be solved (for example, the analysis of different car models). However, it may also be the case that the groups are predetermined by another statistical procedure, such as cluster analysis.
In marketing, discriminant analysis is used, for example, to define different groups of buyers (e.g. frequent, infrequent and non-buyers or thrifty customers versus buyers with an affinity for luxury). The basic idea is to optimally separate these groups from each other by combining several independent variables. The goal is to make the groups as different as possible and the model also explanis these differences well.
Cluster analyses
Cluster analysis is a structure-discovering method. This means that it is used to identify groups in data sets. It does this on the basis of the properties of the objects to be examined. The discriminant analysis explained in the previous paragraph assumes existing groups. Cluster analysis does not; at its end is the creation of the groups.
In cluster analysis, all the properties of the objects to be examined are used to divide them into groups. Using distance or similarity measures, the relationships or differences between the groups are examined. The aim of the analysis is that the groups determined at the end are maximally different from each other.
In marketing, cluster analysis is mainly used when it comes to segmenting customer groups. These can be socio-demographic factors, but also psychological attitudes or specific buying behaviour. The formation of selective groups generally enables a better understanding of customers and a targeted customer approach as well as the minimisation of scattering losses in the conception of media and marketing campaigns.
Data taxonomy development
Data taxonomies are one of the areas that are often overlooked in Data Science, although they are of central importance. Without clean and structured data, all subsequent analytical steps are relatively worthless. Developing data taxonomies means classifying data into categories and subcategories. This enables a uniform view of one’s own data stock. This makes it much easier to understand the relationships between different data points.
Ideally, a data taxonomy creates a uniform terminology across the different systems used for data processing. It forces clarity in the differentiation of the categories used. In most cases, this leads to a better understanding of your own data. Like many of the other methods described here, a taxonomy also serves to reduce the complexity when looking at data. It often makes it possible to make aggregated statements in the first place. In short, a taxonomy helps to categorise data so that it can be used as efficiently as possible. The goal is to ensure that all data in the organisation in question are aligned with each other.
Know relevant data sources and data structures
In order to be able to carry out good and meaningful analyses, it is important to know your own data landscape well. On the one hand, you should know your own data, which is generated within the company, well. As a market-oriented company, you should also have the data of your customers and prospects under control. Finally, it is also important to know and be able to apply the standard KPIs used in our industry for measuring and benchmarking.
The requirements and the complexity of the data will differ greatly here, depending on the industrie your are in and the use cases you are working on. In a media agency, for example, I will not only have to deal with the data of the media and marketers, but also with the vastly different data landscapes of my clients. Hence, in a consulting environment, the demands on these skills are very high, as it is necessary to permanently familiarise yourself with new data structures.
If, however, I only work in a specific industry, such as logistics or consumer goods, then the data to be examined is likely to be less complex overall, as the data and the use cases are easier to delineate here. This does not mean that the statistical methods used are less complex, quite the opposite.
The most important IT skills in Data Science
There are a variety of programming languages and applications in the Data Science & Analytics environment, but the following skills have emerged as particularly relevant.
Python
Python is an open source high-level language that offers a modern approach to object-oriented programming. It offers a wide range of mathematical, statistical and scientific functions. In addition, there are many free libraries that can be used for your own projects.
One of the main reasons for the widespread use of Python in business, science and research is its user-friendliness, coupled with a simple syntax. This allows people without a technical background to familiarise themselves with the language relatively quickly.
The most important Python libraries:
- Numpy: Mathematical functions for processing vectors, matrices and generally large multidimensional arrays.
- Pandas: One of the most popular libraries for data processing and data analysis. In particular, it contains data structures and operators for accessing numerical tables and time series.
- Matplotlib allows mathematical representations of all kinds to be generated.
- SciPy is both an open source software environment and a program library for scientific computing and visualisation and other related activities and is closely related to Numpy.
- Scikit-learn is a widely used program library for machine learning. It is based on SciPy and NumPy.
R
R is a free programming language for statistical calculations and for creating graphics. Compared to Python, it is more specialised in mathematical and statistical functionalities. It is mainly used to perform statistical analyses and develop data visualisations. R’s statistical functions also make it easier to clean, import and analyse data. Many data science teams are bilingual and use both R and Python, for example Python for the general parts of a program and R for the mathematical/statistical components. For Python, but also for other high-level languages, there are a variety of interfaces to R to use the R components in your own programme code. With Shiny it is possible to develop web applications directly from R.
MySQL
MySQL is a widely used and open source relational database system that uses the SQL (Structured Query Language) language. This technology is very important for several reasons. Firstly, it is very widespread, in particular it is often used with web servers to generate dynamic web pages. Furthermore, the basic technology is open source and free of charge, which has also led to its frequent use. Finally, SQL is also relatively easy to learn and enables us to store data in a structured way in a database and perform basic analyses even without in-depth computer science knowledge. To do more complex analyses, we then use the programming languages and libraries mentioned above, all of which can of course access MySQL databases directly.
Excel
An overview of current data science tools would of course be incomplete without Microsoft Excel. The reasons for this are manifold:
- Excel offers a variety of powerful statistical functions.
- As far as the two-dimensional representation of data is concerned, Excel is widely unbeaten. No wonder, since Microsoft has had a quasi-monopoly here since the 80s. In the meantime, there are other Office versions that have caught up, but in my opinion Microsoft continues to set the benchmark here.
- It is installed on most computers in the office sector and is thus a quasi-standard. Standards are not to be underestimated. Many people who would never dare to use a database use Excel and can process data in it. Excel is therefore an important data source in the corporate context.
- VBA is a very comprehensive scripting language that makes it possible to develop and run complex statistical applications in the Excel/Office environment.
- Python and Excel can work well together. With PyXLL there is an Excel addin, with which it is possible to use Excel as a user interface for Python applications. On the other hand, there are many possibilities to work with Excel files in Python. A large selection can be found for example here.