[AG-TECH] Access Grid based seminar

Mon Oct 3 10:47:30 CDT 2005

PLEASE CONTACT jon.gibson at manchester.ac.uk for further details

ESNW/MRCCS Seminar Announcement

"A new algorithm for finding minimal sample uniques for use in statistical
disclosure assessment"

Friday, 7th October 2005, 2-3p.m. BST
Venue: Room 1.10 (ESNW Access Grid) Kilburn Building

http://www.esnw.ac.uk/n-events.shtml

---

Dr. Anna Manning
The Centre for Novel Computing
Department of Computer Science
University of Manchester

Improvements and innovations in computer processing power, disk storage
and networks have led to dramatic increases in the ability to accumulate
and analyze personal data. However, if personal data is made available,
even in an anonymized form, there is a risk of individuals being
identified using statistical disclosure through the matching of known
information with the anonymized data, resulting in material specific to
those individuals being revealed.

This work focuses on the identification of individual records with a high
risk of disclosure, a process otherwise known as Statistical Disclosure
Assessment. The records belonging to certain individuals have a
significant chance of being identified as their contents, or attributes,
are unique and therefore have the potential to be matched directly with
details (including names and addresses) from another dataset. An
illustration of a `risky' record of this type is a sixteen-year-old widow
in a population survey. The ability to comprehensively locate and grade
such records leads to more efficient Statistical Disclosure Control (SDC)
of released data.

A sequential algorithm, SUDA (Special Unique Detection Algorithm), has
been designed and implemented specifically for this problem and is
currently used by the Office for National Statistics (ONS) in London.
Only unique attribute sets without any unique subsets --- Minimal Sample
Uniques (MSUs) --- are considered in order to avoid the use of redundant
information and to keep the classification process as focused as possible.

Although SUDA has greatly increased the depth of risk assessment possible,
the demanding levels of execution time required to find all MSUs mean that
it is restricted to small datasets, particularly in terms of the number of
columns that they possess.

This talk presents SUDA2, a recursive algorithm for finding MSUs. SUDA2
uses a novel method for representing the search space for MSUs and new
observations about the properties of MSUs to prune and traverse this
space. Experimental comparisons with SUDA demonstrate that SUDA2 is not
only several orders of magnitude faster but is also capable of identifying
the boundaries of the search space, enabling datasets of larger numbers of
columns than before to be addressed.