Archive by Author | Dennis Oliver Huisman

Detecting Anomalies in Large Data Sets

Data has become a common concern recently. Both companies and individuals have had to deal with information in multiple ways in order to improve or obtain insight from operations. IT has allowed unprecedented new levels in data management for both these parties. This blog, however, intends to focus on companies’ management of data–more specifically, in the auditing sector.

Fraud has frequently taken place in markets historically. I like to think of Imtech as an example, for those who aren’t familiar with the company there’s no need to worry–I’m sure you have an example of your own. Drawing back to the topic of Data and Fraud, it is becoming increasingly difficult to accurately determine potential fraud cases with the increase of information available. This has given rise to the use of computer algorithms to detect these cases (Pand, Chau, Wang & Faloutsos, 2007).

An interesting way to tackle this challenge is to use mathematical laws for large numbers in order to determine anomalies within these data sets. One particularly interesting example is the application of Benford’s Law to detect these cases of fraud on company documentation (Kraus & Valverde, 2014). In short, Benford’s Law states that 30.1% of random, naturally occurring numbers starts with a 1; 17.6% with a 2; 12.5% with a 3, and so on. Logically this makes sense given our counting structure. This can be expressed as,

Where, is a number {1,2..9} and is the probability of the number starting with d.

Despite the fact that this method seems promising, Kraus and Valverde (2014), could not find any outstanding peculiarities from their data set that contained fraud perpetrators. However, this law does serve a starting point for a drill-down approach to discovering perpetrators. Which brings us to the more strategic topic of whether IT will ever develop a way to outsmart fraud perpetrators in this context? Is an eternal drill-down chase ever going to take the lead?

What do you think? Will this ever be the case? Is there any way you thought this might work out?

I think it’s pointless-of course, as everything, IT methods have their degree of accuracy. However, I firmly believe there will never be a way to completely ensure an honest and transparent market. Not long ago I heard a man say, “Does anybody here know what EBITDA stands for? Exactly. Earnings Before I Tricked the Dumb Auditor.” It’s human nature, and that might take millennia before it changes ever so slightly.

I’d like to say it was nice to write a couple blogs here, till the next time!


Kraus, C., & Valverde, R. (2014). A DATA WAREHOUSE DESIGN FOR THE DETECTION OF FRAUD IN THE SUPPLY CHAIN BY USING THE BENFORD’S LAW. American Journal of Applied Sciences, 11(9), 1507-1518.

Pandit, S., Chau, D. H., Wang, S., & Faloutsos, C. (2007, May). Netprobe: a fast and scalable system for fraud detection in online auction networks. InProceedings of the 16th international conference on World Wide Web (pp. 201-210). ACM.

Anonymous Communication Networks and Their Potential Role in Business


The notions of on-line black markets, services and information that occur and can be found on the ‘Deep Web’ (or however else it may go by–the collection of web addresses not indexed by search engines) are presently relatively well known. For those unfamiliar with the topic, a study in 2001 estimated the size of these non-indexed internet sites to be around 7.5 petabytes (Bergman, 2001). Currently, estimating the size of these non-indexed sites has proven to be even more complicated since content stored in its databases have peculiar features complicating their access. As an example, for data mining, this information can only be accessed through the query interface they support based on input attributes obliging user queries to specify values for these attributes (Liu, Wang & Agrawal, 2011). These intricacies have made it near-impossible to accurately determine, and even harsh to estimate, the size of this vast collection of information.

The ‘Deep Web’ has grown in importance as of late (Braga, Ceri, Daniel & Martinenghi, 2008; Cali, Martinenghi, 2008) and with it, the use of anonymity networks such as Tor. For the readers unfamiliar with Tor, “The Tor network is a group of volunteer-operated servers that allows people to improve their privacy and security on the Internet. Tor’s users employ this network by connecting through a series of virtual tunnels rather than making a direct connection, thus allowing both organizations and individuals to share information over public networks without compromising their privacy.”. In short, Tor works by giving you a randomized relay IP-address donated by volunteers continuously such that no single relay point knows the route a user has taken throughout the random-generated relay route up to the user reaches their desired website. This process is repeated and is different every time the user visits a site. It goes without saying that the more users take part of this tool the “more randomized” it becomes. It does come with one weakness however, and that is node eavesdropping, or end-to-end correlation analysis. In short, both node ends between communication ports can be inspected to find a match in information and thus identify users. (For more information click here).


This is all fantastic news for reporters seeking to share controversial stories in oppressive governmental situations, or in general oppressed people seeking to express controversial messages. However this has also fostered crime through virtual networks, inducing organizations such as Interpol to obtain training in how to use and, “gain the upper-hand” with, these tools (Dutch/English). The amount of discussion about the level of anonymity in this network is vast and can be commonly found throughout both indexed and non-indexed web addresses. It commonly discusses the aforementioned end-to-end correlation analysis, points of encryption, whether to use TAILS, and where to use it from, and how often, self-destruct emails, personalized email encryptions, black bitcoin wallets, and much, much more.

This brings this blog to its fundamental point of contemplation which is much more relevant to this blog’s strategic perspective. How can this play out for business? Can IT-savvy security traders in worldwide financial capitals make use of these tools for own gain? Could this lead to a new financial meltdown where individual players’ lust for capital gains leads to the breakdown of a financial system? How could anonymous messages among acquainted parties play out in the M&A market, where secrecy is a fundamental pillar of daily business? How could this play out for entire industries?


There is no doubt that anonymous communication leads to freedom. This freedom is presently available to whoever has the knowledge to harness it. In a world where knowledge is openly available to any with a computer and internet connection. Both to the oppressed as well as to powerful individuals who are trusted with responsibility, and are closely monitored by third-parties because of mistakes occurred in the past that have led to a negative impact for several individuals.

In my own conceit, these tools can prove to be radical for business. A truly anonymous communication network would return the fullest definition of trust back into business. However, knowing human nature as it has shown itself throughout history, these tools might prove harmful in the long-run, assuming they are kept being worked on and made more common among society.

Author: Dennis Oliver Huisman
S.I.D.: 369919dh


Bergman, M. K. (2001). “The Deep Web: Surfacing Hidden Value”. The Journal of Electronic Publishing (1).

Braga D., Ceri S., Daniel F., Martinenghi D. (2008). Optimization of multidomain queries on the web. Proceedings of the VLDB Endowment, 1(1): 562–673.

Cali A., Martinenghi D. (2008) Querying data under access limitations. In: Proceedings of the 24th IEEE International Conference on Data Engineering, 50–59.

Liu, T., Wang, F., Agrawal, G. (2001). “Stratified sampling for data mining on the deep web”, Frontiers of Computer Science (179-196)