Over the past few decades, statisticians have developed some very sophisticated modeling techniques for finding patterns in data sets. I've heard that people use these models in many different contexts: credit card fraud detection, pricing retail products, screening for cancer.
Personally, I think it's fun to try modern machine learning algorithms to find patterns in or build predictive models from data. However, in my experience, I've found that simple regression models or summary statistics often provide the most valuable information.
Today, I stumbled on a great analysis of what statistical methods academic bioinformatics researchers use most frequently. (Well, technically, the analysis shows what types of analysis were found in published papers from these researchers.) In an article titled Use of statistical analysis in the biomedical literature by Scotch et al, four researchers from Yale University's Center for Biomedical Informatics catalog the different analysis techniques used in the Journal of the American Medical Informatics Association and the International Journal of Medical Informatics over a seven year period.
I plotted the results shown in table 2 using R. (I focused only on publications with statistics, and only on a summary of the results.)
In case you're curious, here's the code I used:
v <- c("with statistics"=593, "descriptive"=544, "elementary"=248, "multivariable"=69, "data mining"=60, "other"=141) barchart(v[6:1], col="black", xlab="Total Number of publications")
What's interesting to me is that almost all publications that included some statistics included descriptive statistics (like counts, means, or other simple measures). Next most popular were statistical tests (like t tests and chi-squared tests). Far fewer papers included data mining techniques. This seems to imply that the simplest techniques are the most useful.
Do these results apply in fields outside bioinformatics? Is this analysis still current (the survey started with articles published ten years ago)? Did researchers choose these techniques because they were they wanted to use these techniques, or because these were the easiest to use techniques, or because these techniques were best implemented in available software? It's hard to know, but this article does provide some interesting data on what tools scientists use for analysis.