The idle ramblings of a Jack of some trades, Master of none

  I attended the latest useR! conference this week, which lasted three days (Aug 12-14) at Dortmund. We've been using R at work for about a year or so now, and there's much to learn about this incredible statistical language and its ancillary 'packages', tools developed by academicians and practitioners in a variety of disciplines, all available free and online, supported by a vast cast of enthusiasts and gurus. So here I was, toting my little bag of goodies, and ambling from room to room to listen to some very interesting presentations across the R user spectrum.

The conference was held at the Statistics department of the Technical University of Dortmund. Only two weeks earlier, torrential rains had flooded the large auditorium of the department. Stains were still visible, but a massive cleanup before our arrival meant that everything appeared as it should - ready for four hundred visitors.

Unlike Curving Normality who blogged regularly and live from the various sessions he attended, I am doing this from back at home. I did take notes during the talks on a little pad that was given to us courtesy Google, but my handwriting these days is much worse than it used to be, and I doubt even a pharmacist would be able to make much of it. Still, here goes - summaries of three of the more interesting talks I attended.

The first talk I attended was on Loss Functions by a professor of actuarial studies, Vincent Goulet. Actuaries are interested in such things as ruin, the distribution of insurance claims, and the probabilistic properties of insurance payouts. To model these events, Goulet introduced an R package named 'actuar', in which he proposed several families of distributions, including those called censored. These, in particular, are interesting to insurance specialists. A client with a deductible is not going to make a claim if he thinks his damages cost less than the deductible to make good. In such a situation, the distribution of claims would be left-censored. Likewise, insurance payouts are usually capped by the size of the cover. Payouts, therefore, follow a right-censored distribution.

This is of relevance to us in finance as well, especially if we want to model the effects of management fees on a portfolio. Because irrespective of how well a manager performs, he always gets his management fee, the returns to a client from the manager are always that much less than the manager's overall performance. For the manager's revenue stream, on the other hand, there's an effective floor - and that can be modelled by a left-censored distribution.

An example of the multifarious uses of statistical tools came from Miriam Marusiakova, of the Charles University of Prague, who presented an R package 'forensic' to help in DNA fingerprinting. It is well-known that the DNA composition of any two human beings (other than identical twins) are distinct. Since it is unfeasible to compare DNA strands in their entirety from various possible sources,  forensic scientists have isolated certain markers that can serve as witnesses for distinction. Unfortunately, these markers themselves do not provide sufficiency of difference, and so a statistical analysis is required to determine how likely it is that the DNA found at a location came from one person or many.

Naturally, this is important! Let's say that a certain amount of DNA was found at a crime scene. There is a victim V and a suspect S. There are three possibilities: a portion of the DNA is known to be of the victim; there's only one type of DNA, suspected to be that of the offender; there's several sources of DNA found. The prosecution's hypothesis is that some of the DNA came from S. The defence's hypothesis is that the remaining DNA came from persons unknown U. How to determine which of the hypotheses is the correct one?

Miriam explained that external information needs to feed into the statistics. For instance, certain genetic factors are present to a greater or smaller degree in various populations. Not incorporating these factors into the statistics leads to overstating the case against the defendant. The classical probability law, named the Hardy-Weinberg law, is inapplicable in this case.

Another twist is that the offender and/or the victim, and the defendant are related! Usually, there is an assumption that the offender and the victim are independent. Miriam's package enables the analysis of all the above possibilities.

As a concrete example, she showed the widely different conclusions that could be drawn from the O.J. Simpson murder trial of the early 1990s in California. If the various factors and match probabilities were not estimated correctly, it was as easy to prove that the DNA found at the crime site was Simpson's as it was not. Lesson - estimate accurately and account for all possible variations.

The effervescent Janet Rosenbaum of Harvard University produced one of the most entertaining examples of research I've ever come across. She dealt with the notorious abstinence (virginity) pledge in the USA, and examined whether the sexual behaviour of teenagers who took the pledge was any different from those who didn't. (See here for some of her work, and this news report at the Washington Post.) I can do no better than quote from her very thorough abstract at the conference:

Objective: The US government spends over $200 million annually on abstinence-promotion programs, including virginity pledges, and measures abstinence program effectiveness as the proportion of participants who take a virginity pledge. Past research used non-robust regression methods. This paper examines whether adolescents who take virginity pledges are less sexually active than matched non-pledgers.

Previous researchers had compared the sexual behaviour of pledging teenagers against the general population of teenagers, and concluded that, indeed, the former were less likely to have had sex, and had a lower incidence of sexually-transmitted disease. For the US conservatives, this was brilliant news, meaning they could cut federal funding for contraception and women's sexual health, and provide instead abstinence coaching and use the numbers of pledgers as a metric of success.

But, of course, the comparison is not fair. The correct thing to do would be to match pledging teens with non-pledging teens who have similar backgrounds and ideologies. After all, the people who take the pledge are not average US teens. Many of them are from evangelical families, deeply religious, often born-again. When this matching is done, the results are quite clear:

Five years post-pledge, 84% of pledgers denied having ever pledged. Pledgers and matched non-pledgers did not differ in premarital sex, STDs, anal, and oral sex. Pledgers had 0.1 fewer past year partners, but the same number of lifetime sexual partners and age of first sex. Pledgers were 10 percentage-points less likely than matched non-pledgers to use condoms in the last year, and also less likely to use birth control in the past year and at last sex.

The behaviour of pledging and non-pledging teens is statistically identical! Worse, one to five years after having taken the pledge, 84% of those teens denied having pledged. Egregiously, many who had sex before taking the pledge declared themselves virgins shortly thereafter. To add insult to injury, pledgers were often more ignorant of contraception when they did succumb and have sex, and were thus less likely to protect themselves from disease or pregnancy before marriage.

Rosenbaum concluded that federal funds would be better spent in teaching effective birth and STD control than on abstinence measures.

Other interesting presentations:

  1. Tomoaki Nakatani, ccgarch: An R package for modelling multivariate GARCH with conditional correlations.
  2. Rory Winston, Real-Time Market Data Interfaces in R. (How to connect to Reuters from R)
  3. Susana Barbosa, ArDec: Autoregressive-based time series decomposition in R.
  4. Ray Brownrigg, Tricks and Traps for Young Players.
  5. Wei-Han Liu, A Closer Examination of Extreme Value Theory Modelling in Value-at-Risk Estimation.
  6. R. Ferstl, J. Hayden, Hedging Interest-Rate Risk with the Dynamic Nelson-Siegel Model.


Rense Nieuwenhuis said...

Hi There,

thanks for referring to Curving Normality. Indeed, I was attempting to live-blog from the conference. I attended the Vincent Goulet presentation as well, and thought that is was very interesting. Also, I would have loved to attend the Rosenbaum one, but unfortunately something else was scheduled at the same time.

Thanks for reminding me of that session, it might be a real help in my 'real' job!

Fëanor said...

Hiya: thanks for stopping by. Glad you liked the Rosenbaum summary - I thought her work was well done. Best of luck for yours!

Post a Comment