Archive | big data RSS for this section

Detecting Anomalies in Large Data Sets

Data has become a common concern recently. Both companies and individuals have had to deal with information in multiple ways in order to improve or obtain insight from operations. IT has allowed unprecedented new levels in data management for both these parties. This blog, however, intends to focus on companies’ management of data–more specifically, in the auditing sector.

Fraud has frequently taken place in markets historically. I like to think of Imtech as an example, for those who aren’t familiar with the company there’s no need to worry–I’m sure you have an example of your own. Drawing back to the topic of Data and Fraud, it is becoming increasingly difficult to accurately determine potential fraud cases with the increase of information available. This has given rise to the use of computer algorithms to detect these cases (Pand, Chau, Wang & Faloutsos, 2007).

An interesting way to tackle this challenge is to use mathematical laws for large numbers in order to determine anomalies within these data sets. One particularly interesting example is the application of Benford’s Law to detect these cases of fraud on company documentation (Kraus & Valverde, 2014). In short, Benford’s Law states that 30.1% of random, naturally occurring numbers starts with a 1; 17.6% with a 2; 12.5% with a 3, and so on. Logically this makes sense given our counting structure. This can be expressed as,

Where, is a number {1,2..9} and is the probability of the number starting with d.

Despite the fact that this method seems promising, Kraus and Valverde (2014), could not find any outstanding peculiarities from their data set that contained fraud perpetrators. However, this law does serve a starting point for a drill-down approach to discovering perpetrators. Which brings us to the more strategic topic of whether IT will ever develop a way to outsmart fraud perpetrators in this context? Is an eternal drill-down chase ever going to take the lead?

What do you think? Will this ever be the case? Is there any way you thought this might work out?

I think it’s pointless-of course, as everything, IT methods have their degree of accuracy. However, I firmly believe there will never be a way to completely ensure an honest and transparent market. Not long ago I heard a man say, “Does anybody here know what EBITDA stands for? Exactly. Earnings Before I Tricked the Dumb Auditor.” It’s human nature, and that might take millennia before it changes ever so slightly.

I’d like to say it was nice to write a couple blogs here, till the next time!


Kraus, C., & Valverde, R. (2014). A DATA WAREHOUSE DESIGN FOR THE DETECTION OF FRAUD IN THE SUPPLY CHAIN BY USING THE BENFORD’S LAW. American Journal of Applied Sciences, 11(9), 1507-1518.

Pandit, S., Chau, D. H., Wang, S., & Faloutsos, C. (2007, May). Netprobe: a fast and scalable system for fraud detection in online auction networks. InProceedings of the 16th international conference on World Wide Web (pp. 201-210). ACM.

There is a tsunami of data headed our way

Big Data, the term referring to data sets so large and complex traditional data processing applications are not sufficient enough to be able to process it. This is because we are saving data on a large scale beyond operational data which is mainly due to the rise of the internet. At this moment, Big Data is still quite manageable but there is an emerging technology that will make the current Big Data seem like Small Data.

I’m talking about the rise of the Internet of Things. Right now, this technology it is at the height of the Peak of Inflated Expectations of the Gartner Hype Cycle[1]. It is thus expected that the amount of ‘things’ will grow exponentially in the coming years because these ‘things’ can be any object with built-in sensors like cars, phones, watches, doors, shoes, lights switches, anything.


Cisco expects that the number of connected devices will grow from 18 billion in 2015 up to 50 billion in 2020[2]. Each of these ‘things’ will generate a massive amount of data contributing to the already large influx of data. Right now, about 4.4 ZB (that is 4.4 trillion gigabytes) of data is generated each year[3]. This is expected to grow to more than 40 ZB in 2020. Companies saving data because of just saving data are at risk of being washed away by a tsunami of data headed their way.

NSA for example, an institution we have all heard of, is already drowning in data. They have adopted a strategy to filter out data that does not contain useful content in order to diminish the amount of data they have to save by about 20%.[4] Instead of just taking all there is, analysts are now trying figure out what is actually necessary. Ofcourse, it is very hard to know in advance what information is useful and what is not. But companies will have to start thinking about this topic and what to do what all that data headed their way.

According to Computerworld, the coming trend in Big Data and Business Intelligence will be to structure the collection of data and to know exactly what data is collected. [5] I think they are quite right, I have experienced on my own that companies just collect data for sake of collecting without knowing is actually important. Only after you have done some analysis you find out what is useful and what is not. But in the coming age of the internet of things, with so many gigabytes of data coming our way, this strategy is just not feasible anymore.

Tony Jordan – 400986






Modern Wars: How Information Technology Changed Warfare

Modern Wars: How Information Technology Changed Warfare

On the 30th of September Vladimir Vladimirovitsj Putin, President of Russia, announced that Russia conducted its first air-strikes in Syria targeted at ISIS (or ISIL). However, in the days after the United States of America and other countries began to question Russia’s motive and use of old school bombing technology which might cause harm to civilians and inflame the civil war in Syria (CNN/Time, 2015). According to US official’s Russian bombing technology is a lacking behind American weaponry in terms of accuracy. As such moves increase the tensions between the East and the West and businesses use information technology to reach their goals, I started to research how information technology has changed warfare over time.

A B-2 stealth bomber refuels.

A B-2 stealth bomber refuels.

The main goals of warfare have not really changed, but the way wars evolve and are waged certainly have. Just hundred twenty years ago, armies marched to battle in their uniforms, lined up against one another, and mainly used weapons with a short effective range. Thus, people who killed one another were always in close proximity. Later on, longer-range weapons emerged, and the distance between the soldiers became larger and larger. Today, some countries have the capability to destroy towns without having to be physically at the site or even have a within a hundreds of miles. All due to the introduction of IT in modern warfare which enables people to fight wars with the touch of button. This instantaneous transfer of information through the Internet and availability of the Internet around the world increases the number of participants in war. Unarmed actors thousands of miles away can participate in a conflict even by sitting at their computer, providing funding or (video/picture) information through the Internet or deep web.

Read More…

Deep Learning: Teaching Machines To Act Human

Deep Learning -Teaching Machines To Act Human

Recently there were increased news articles about AI: Artificial Intelligence. Some very smart people were concerned about the progress made in the field of advanced machine learning. Among them were serial Entrepreneur Elon Musk, the famous researcher Steven Hawkins and legendary philanthropist Bill Gates. All of them signed an open letter expressing their concern about the future of AI. Cause for the signage was a video showing Google owned company Boston Dynamics recording a trial run of their human robot ‘Atlas’ running through the woods among other recent advances in advanced machine learning.

What is advanced machine learning?

The field of machine learning in computer science has been there for a while. Starting during the second world war, the first attempts to teach computer to learn and being human were made. The recent movie around pioneer Alan Turing shows the origins of this scientific research field. Until today the Turing Test is still applied to evaluate if a computer is categorized as intelligent.

During the 80ies and early 90ies further attempts were made to teach computers to behave human. Early solutions weren’t practical caused by the limited processing power during that time. A lot of time passed since then.

So what exactly is machine learning? It’s basically to teach a computer to make sense of data. To teach him to recognize patterns in input values and gain insights from the process. Simple machine learning can be a regression analysis or simple classification of data depending on a single value pair into different categories. Advanced machine learning, of which deep learning is a part of, applies multiple analysis layers in analyzing big data sets. The first layer of an algorithm look only at certain parts of the data and then deliver the output value to an analysis layer further up the hierarchy conducting more abstract calculations with the input from the lower layer and itself delivering values to an even more abstract layer of algorithms. This structure allows the modeling of the human brain, imitating the network of neurons in the brain with many (trillions) of synapses.

Tapping into the huge potential

Today many layers are applied to solve difficult data analysis problems, therefore the name deep learning. With this methodology it is possible to teach a computer to analyze pictures, handwriting, speech, maps or even videos. In the future all applications that seem to be ‘magical’ will be the result of some kind of deep learning. The application are many: Categorization of images, indexation of unlabelled data, analysis of maps, using big data of many sources to refine and improve prediction models and so forth.

Facebook, Google, IBM and many start-ups today already apply deep learning technologies to gain an edge solve difficult problems. Until today there is no computer who can itself program something that can program. But that day will come, its just a matter of time.

Is it dangerous? Maybe. But it can also do much good if applied correctly.

If you’re interested in deep learning, here are some very interesting companies applying this cutting edge technology:

Have you heard about deep learning before? What do you think: Is it the future? Are you afraid of AI? I’m interested what you think so please leave a comment!


Big data only matters if you have a question

Everybody is talking about “big data”. The possibilities of collecting, generating & storing data are endless. Because of its popularity, big data is also more and more becoming a buzz. It has assumed such a variety of terms that it big date has become an unclear term if it is not further specified. When somebody is talking about big data, chances are slim that everybody in the room is thinking about the same thing.


Today, every company feels the need to start using big data. Everybody is talking about it, more and more companies are doing it, hence we need to use it as well ! But often they forget a very important thing

…data on its own is meaningless. The value of data is not the data itself – it’s what you do with the data.

First you need to know what data you need and only after that you should start collecting. Because why would you put so much energy and effort in collecting data that you cannot use to deliver relevant business insights?

Many people try to start with the solution; “we need to use big data”, rather than start with the problem. In order to have a successful data strategy, you need to begin with defining the insights needed to identify the pathways towards growth, otherwise you will drown in all the data available. The focus should always lie on the questions and not on the solution. Big data can offer many answers but it will always require a person to frame the question, identify the data that will be able to provide an answer and interpret the obtained results. Then these results can be used to create a strategy that can add value to your business.


For example, if you wish to enter the weight management market, questions you will need answers to might be, ‘How many people are overweight?’, ‘how many people are interested in losing weight?’, ‘what is the average income of these people?’ and so on. Identifying what needs to be done in order to collect this data will then be a lot easier.

Everybody now has the opportunity to use data. Availability is not the issue anymore. However answers to things that don’t matter, won’t bring you any further. If you focus on the relevant questions first, and tackle them with big data, then the power of data will be of great value to your company.

By: Melanie Pieters, 420914MP


Electronic Markets, Computing Power and the Quants: Volatility & High Frequency Trading

Electronic Markets, Computing Power and the Quants:  Volatility & High Frequency Trading

Markets can be – and usually are – too active, and too volatile”

Joseph E. Stiglitz – Nobel prize-winning economist

As some of you might have noticed, the oil market is currently showing wilder fluctuations at a higher frequency than before: volatility has increased. This happened after the market enjoyed relative stability price stability during the last few years. Of course, this is partly due to U.S. shale oil production, quite high supply and lower demand due to the financial crisis aftermaths, and growing demand and supply uncertainties. However, another factor affecting volatility is the increased usage of trading indicators in combination with changes in trading practices: an increasing number of players in the financial markets tend to use algorithmic and high-frequency trading practices (HFT).

Like other derivative based markets, also the crude oil market has a wide range of market players of which many are not interested in buying physical oil. HFT traders are probably drawn towards oil futures due to the market’s volatility. Because, the greater the price swings, the greater their potential profit. HFT is not an entirely new practise, but as technology evolves it is increasingly present in today’s electronic financial markets.

These players make extensive use of computing and information technology in order to develop complex trading algorithms, which are often referred to as the “quants”. HFT trading firms try to gain advantage over other competitors which are still using mostly human intelligence and reaction times. The essence of the game is to use your algobots to get the quickest market access, fastest processing speeds, and perform the quickest calculations in order capture profits which would have otherwise been earned by someone who is processing market data slower (Salmon, 2014). At essentially the speed of light, these systems are capable of reacting to market data, transmitting thousands of order messages per second, as well as automatically cancelling and replacing orders based on shifting market conditions and capturing price discrepancies with little human intervention (Clark & Ranjan, 2012). New trading strategies are formulated by using, capturing and recombining new information with large datasets and other forms of big data available to the market. The analysis performed to derive the assumed direction of the market makes use of a bunch of indicators such as historical patterns, price behaviour, price corrections, peak-resistance and low-support levels, as well as (the moving average of) trends and counter-trends. By aggregating all this information, the databases and its (changes of) averages are usually a pretty good predictor of potential profits for HFT companies.

This information technology enabled way of trading is cheaper for the executors, but imposes great costs on workers and firms throughout the economy. Although quants provide a lot liquidity, but can also alter markets by placing more emphasis on techniques and linking electronic markets with other markets (as well information as financial linking). In most cases, non-overnight, short-term strategies are used. Thus, these traders are in the market for quick wins and use only technical analysis in order to predict market movements instead of trading based upon physical fundamentals, human intelligence or news inputs.

Recent oil price volatility increased

Recent oil price volatility increased

Although, some studies have not found direct prove that HFT can cause volatility, others concluded that HFT in certain cases can transmit disruptions almost simultaneously over markets due to its high speed in combination with the interconnectedness of markets (FT, 2011; Caivano, 2015). For example, Andrew Haldane, a top official at the Bank of England said that HFT was creating a system risks and the electronic markets may need a ‘redesign’ in future (Demos & Cohen, 2011). Further sophistication of “robot” trading at decreasing cost is expected to continue in the foreseeable future. This can impose a threat to the stability of financial markets due to amplified risks, undesired interactions, and unknown outcomes (FT, 2011). In addition, in a world with intensive HFT the acquisition of information will be discouraged as the value of information about stocks and the economy retrieved by human intelligence will be much lower due to the fact that robots now do all the work before a single human was able to process and act on the information (Salmon, 2014). For those interested in the issues of HFT in more detail, I would like to recommend the article of Felix Salmon (2014).

However, it is important to mention that not only HFT and automated systems and technicalities do cause all the volatility. Markets have known swift price swings for centuries. For example in the oil industry, geopolitical risk can cause price changes as it is an exhaustible commodity. As most people know, also human emotions can distort markets as well as terrorist actions. Even incomplete information such as tweets from Twitter and Facebook posts can cause shares to jump or plumb nowadays. As markets are becoming faster, more information is shared and systems can process and act on this information alone quickly due to (information) technological advancements, which will in turn increase volatility. Therefore, it is more important than ever that there are no flaws in market data streams, e.g. the electronic markets and its information systems need to have enough capacity to process, control, and display all the necessary information to market players in order to avoid information asymmetries.

In my opinion, HFT is strengthened by the current state of computing technology and cost reductions of computing power now enable the execution of highly complex algorithms in a split-second. As prices go down and speed goes up, these systems will become more and more attractive as they outperform human intelligence. This can potentially form an issue in the future: volatility might increase and it is this volatility that provides many opportunities for traders, but not the necessary stability for producers and consumers which are more long-term focussed.

Therefore, in the future action is necessary toquality-efficiency-speed-cost restrict, or at least reduce, HFT. Examples might be big data collection by regulators to monitor risk and predict future flash crash or volatility events. Another option can be the introduction of a “minimum resting period” for trading. So traders have to hold on to their equity or trade for a pre-specified time before selling it on, reducing the frequency and thus volatility. Also, widening spreads will help as it makes quick selling and buying more costly and thus HFT less attractive.

Given that the financial market’s watchdogs currently have difficulties with regulating automated trading. Some HFT firms have enjoyed enormous profits from their trading strategies (Jump trading, Tower Research capital, DRW). For example also during the last turmoil of August this year, a couple of HFT firms earned a lot of money (Hope, 2015). Due to these successes, new players enter the market and competition is growing. As speed is essential (even milliseconds matter) HFT firms try to place their servers physically near the exchanges (such as the NYSE), so they can increase their advantage. The HFT firms are expected to stay in the market, ultimately resulting in more price volatility (Hope, 2015).

What do you think, how far should we let our technology intervene with the financial markets? Do we really need to allow algobot’s or similar automated trading systems to influence our financial markets as they can perform the human job faster, fact-based and at a lower cost? Or should the financial markets be always human intelligence based, which might be ultimately better for the economy as a whole and also provides a richer knowledge base of the real world economy (as it this information remains valuable and numbers do not always say everything)?

In case you are interested in this dilemma, I can also recommend reading Stiglitz’ speech at the Federal Reserve Bank of Atlanta in 2014.

Author: Glenn de Jong, 357570gj

Read More…

How to lose your Internet weight!


Ever wonder why seemingly simple web pages take long to load? Chances are it’s not the content that is weighing you down, but third party trackers. These parties track your movement over several websites, analyzing your behavior and record your browsing habits.

To put things into perspective the following example from a Guardian article provides some figures. In loading the popular tech site ‘The Verge,’ the actual content only ran 8k, whereas the surrounding ads ran to 6MB. During a second study, it was discovered that almost an order of magnitude more data is needed for the trackers than the article itself. A more recent study explained this ‘weight’ in a measure more familiar to the everyday user; seconds. By turning off third party scripts, the homepage loaded within 2 seconds, down from 11 seconds.

We can therefore conclude that third party trackers take up more data (and thus more time) not only during the launch of a page, but throughout the entire duration of our browsing. These are two luxuries few of us can afford on our mobile contracts nor during our busy lives. An additional consequence for portable devices is less battery.

But fear not, solutions are available, with my personal favorite being Ghostery! This extension for your browser identifies which services are trying to track you and then gives you the option to block them. When first installing the extension the following categories are given; Advertising, Analytics, Beacons, category_iso, privacy and widgets. Given you full control of where you would like to make exceptions.

I like the newly found control I was previously unaware of. Since installing I can say I have observed a slight increase in speed. However, I do enjoy it when I load specific websites that a customized experience is opened for my profile, for example making my purchase recommendations better or my preferred settings more familiar. This customization is reduced if not lost when disabling all the trackers, therefore in the future I might consider allowing certain websites to use third party trackers for my own benefit.

Remember; “if you’re not paying for something, you’re not the customer, but the product!”


-Is this something you would install?

-What are some additional trade-offs of these services?

-If adopted on a grand scale could these extensions create problems?


Amazon knows what you´re going to buy BEFORE you even push “buy” – they know you too well!

Everybody is highly aware about the data-driven culture at Amazon, and how they are utilizing Big Data in every direction to boost revenues. Now they are going down another avenue of their business models powered by Big data, more specifically, their distribution channels and how they are delivering products to their customers. Amazon wants to ship products to their customers even before they make a purchase – Amazon knows their customers very well through Big Data patterns. They use previous orders, product searches, wish lists, shopping-cart contents, returns and even how long an Internet user’s cursor hovers over an item to decide what and when to ship. This “anticipatory shipping” will dramatically reduce delivery time and probably increase customer satisfaction to the extent that the customer will be even more willing to use online-channels. Amazon continues the battle of customers with instant order fulfillment, which everybody I assume has experienced at IKEA. IKEA´s business model and its value proposition is reflected upon their capability of instant order fulfillment, maybe not on all products, but especially the fast movers. We love to get the products we order and buy right away, so imagine the case of ordering from home and get it the same day or even within a couple of hours. It is important to mention that this is not something that Amazon has implemented yet, but only filled a patent. Anyway, this truly reflects the capabilities of Amazon´s data scientists to utilize Big Data to transform their business model and in the end use their supply chain as a strategic weapon. This shows the relentless implications of predicting customer behavior/demand.

If we take this predictive shipping to the next level and combine it with Amazon´s vision of transporting products using unmanned flying vehicles, they will dramatically change the order fulfillment process of both online and offline players. The drawback of this predictive shipping process would of course be costly return and unnecessary impact on the external environment. But as this algorithm is constantly fed with new data, the prediction will strengthen by time. So next time you´re diving into Amazon´s endless world of products it might be the case that one of these products is already on its way!


Big Data and Mobile Data Security: Two bagels and a Cup of tea

Every day you wake up with that same daily ritual: alarm goes off, you get ready, you leave the house. Given your high environmental consciousness (or the lack of a driver’s license), you take as part of your morning ritual to set off on a train to get to your destination. To help you kill time and make your ride more enjoyable you take your mobile phone out of your pocket, connect to the train’s Wi-Fi and complete your journey as most of the people in the train.

What seemed to be a normal day may come with an unpleasant surprise. We often make use of public hotspots to save a few of those megabytes that consume our bill at month end. However, what is often neglected is the security of these connections.

Hannes Muhleisen is just an Amsterdam citizen who happens to live in a boat. In one regular afternoon he was setting up his internet connection when his laptop recognizes the Wi-Fi network very familiar to many of us, the “Wifi in de trein”, as a train passed by. Curious, Muhleisen decided to experiment by setting up equipment to ‘listen in’ into the devices of the train’s travelers (Maurits, M 2015). Would NS provide such an unsecure connection to its customers? With two antennas and some open software, Hannes was set to test. Thus, you are probably wondering: what kind of information was he able to pickup?

  • 114,558 different MAC-addresses over 5 months
  • Unique numbers of devices, time and data
  • Device history of web-browsing and app usage
  • Types of devices the travelers were using (e.g. Apple, Samsung, etc.)

For the additional fun, Muhleisen even created a model evaluating Wi-Fi usage based on the weather.


Muhleisen’s example is just one in many of the big data security and privacy concerns. However, these extend further than simply individuals’ data security, it also affects society and organizations. Among the top 10 big data privacy risks are (Herold, R 2015):

  1. Targeted marketing leading private information to become public.
  2. The need to have one piece of data linked to another to make sense would make your data impossible to be anonymous.
  3. Based on the previous point, data masking could easily be overrun and reveal personal information.
  4. Big data can be used to influence business decisions without taking into account the human lives involved.
  5. Big data does not contain rigorous validation of the user data, which could lead to inaccurate analytics.
  6. Big data could lead to discrimination of job candidates, employee promotion and more because it is an ‘automated’ discrimination.
  7. There are only few legal protections to involved individuals.
  8. Big data is growing indefinitely and infinitely making it easier to learn more about individuals.
  9. Big data analytics allows organizations to narrow documents relevant to litigation, but raises accusations of not including all necessary documents.
  10. Due to the size of big data it makes difficult to make sure patents and copyrights are indeed unique.


All these implications lead to major concerns towards IT security investments, paranoia and conspiracy theories. How to tackle all the ethical implications that come with big data? If one man with two cheap antennas can collect enough data to learn what you ate for breakfast, what can corporations do to trigger behaviors using first-of-line equipment? Whether you are an iOS or Android user, the big brother is watching.

Lilian Shann, 342890ls


Marits, M (2015). De wifi in de trein is volstrekt onveilig (en de NS doet er niets ann), [Online], Available at:  [Accessed: 13 September 2015].

Herold, R (2015). 10 Big Data Analytics Privacy Problems, [Online], Available at: [Accessed: 13 September 2015].

Damato, T (2015). Infographic: What’s threatening your mobile apps?, [Online], Available at: [Accessed: 13 September 2015].

Big Data for Big Energy Savings.

Articles on big data and its many implications for the world around us are yet around for a while and numerous. From improving healthcare and optimizing firm performance to fooling terrorists plots by the NSA, big data’s reach touches nearly all imaginable facets of the modern world. The plots and marketing of the hit series House of Cards has even been based by its creators, Netflix, on observations from their 33 million users, a topic various authors discussed on this blog. These are all great examples of the seemingly unlimited applications of big data and the impact it can have on all of us.

Though, what seems to miss in all the buzz around Big Data is what it can do to help up face the probably biggest challenge of our generation; drastically decreasing our energy consumption.

Is this because of the reluctance of the consumer or are the large energy companies to blame?
Either way, the solution may very well lie in the use of Big Data.

If Netflix created House Of Cards success largely by using the known habits, preferences and ratings of its users available to them, why can’t the energy firms come up with personalized advices and energy saving offerings to its customers?

The Tata Consultancy Services 2012/2013 Big Data Study revealed that the Energy sector is one of least advanced sector with regard to the use of Big Data. Whereas in the same study is stated that Energy firms Big Data spending in 2012 resulted in a 60.6% return, the second highest score!


Table from The Tata Consultancy Services 2012/2013 Big Data Study

Energy companies could start right away with using the terabytes of data on past bills and participation in past energy saving initiatives to send out personalized advices on energy saving. Furthermore, they can send personalized offers for energy-saving products and measures. This can greatly increase the engagement of their customer and result in new revenue streams. And effectively face the greatest challenge of our generation.

How would you feel about receiving personalized advices and offerings to reduce your energy consumption from your energy provider, based on your past energy consumption behaviour?


The Tata Consultancy Services 2012/2013 Big Data Study, 2013, Tata Consultancy Services Limited

Why Big Data Analytics Could Spell Big Energy Savings, 27 February 2013, Spotfire Blogging Team