PhD projects

A successful PhD project starts with finding the right combination of supervisor and topic.

If you intend to apply for a PhD programme, you are welcome and encouraged to formulate your own research proposal.
Nevertheless, this page suggests some research topics which may help you to quickly identify a PhD project of your interest under my supervision.
If you wish to discuss a particular topic (included or not in the following list), please send an email to "G.DiFatta at reading ac uk".

List of research topics: (click to expand)

Big Data Analytics and Mining

Big Data indicates very large and complex data sets that are difficult to process using traditional and sequential data processing applications. Data-intensive, parallel and distributed approaches are typically employed, such as the MapReduce programming paradigm (Apache Hadoop). However, one of the most difficult and interesting challenge is not about the size of the Data, rather it is about the insight and the impact the analysis of the data can generate. From this perspective, providing effective and efficient algorithms and tools for Big Data Analytics and Mining is a fundamental aspect. The potential of Big Data is in our ability to provide solutions to business problems, to provide new business opportunities and to facilitate a data-driven discovery in Science. The project will investigate and test distributed formulations of data mining algorithm that are suitable for the MapReduce paradigm and for other distributed computing approaches.

Keywords: Big Data, Data Analytics and Mining, Parallel and Distributed Computing

Data Integration, Processing, Analysis, Exploration and Visualisation

Open-source, user-friendly Data Mining workflow management environment are more often adopted for data integration, processing, analysis, exploration platform and visualisation. The project will contribute to widen a repository of algorithms and allows their composition in an intuitive way. It can be extended and customised by means of the flexible meta-programming paradigm of eclipse plug-ins.

Keywords: Data Mining, Knowledge Discovery in Databases, Intelligent Data Analysis

Frequent Pattern Mining

The identification of regular patterns in large sets of data can be formulated as Association Rule Mining, Frequent Itemset Mining, Frequent Subgraph Mining, according to the particular application domain and problem. They share a combinatorial complexity and can be solved with analogous algorithmic approaches. When patterns are naturally classified in two categories, one important application is the identification of the features which allow discriminating one class from the other. Highly scalable algorithms for the "Discriminative" Subgraph Mining problem can be applied, for example, to the identification of candidates in the drug discovery process.

Keywords: Data Mining, Subgraph Mining, Knowledge Discovery in Life Science Repositories

High Performance and Scalable Clustering

Clustering is a classical unsupervised machine learning problem of the identification of groups of similar objects within a set. One of the most popular and influential algorithm in Data Mining is k-Means. So far, the most efficient implementation of k-Means is based on multi-dimensional trees (KD-Trees). BSP-kMeans is an even more efficient and scalable k-Means variant, which can be applied to very large data sets (millions of patterns) with high numbers of features and clusters.

Keywords: Data Mining, Clustering, Scalable Algorithms

Data Mining and Visualisation in High Dimensional Spaces

The mining and the visualisation of data in high dimensional feature spaces require the design of efficient algorithms. In high dimensional data spaces distance functions lose their usefulness and optimisation techniques, Bayesian statistics, machine learning and data mining algorithms are very inefficient and ineffective. This problem is referred to as 'the curse of dimensionality' and is caused by the exponential increase in volume associated with adding extra dimensions to a mathematical space. In general, dimensionality reduction is a fundamental methodology for the success of the knowledge discovery process in many real-world applications.

Keywords: Data Mining, Data Visualisation, Dimensionality Reduction

Epidemic Protocols for Fault-tolerant Extreme-scale Computing

Epidemic or Gossip-based protocols adopt a bio-inspired communication strategy which is based on a similar mathematical model of the exponential and incontrollable spread of infectious diseases. Epidemic protocols are suitable for large and extreme-scale, distributed and dynamic systems. They can be adopted to disseminate information (broadcasting) in a large-scale distributed environment using randomised communication. Their advantages over global communication schemes based on deterministic overlay networks are their inherent robustness and scalability. Epidemic protocols can also be adopted to solve the data aggregation problem in a fully decentralized manner. The project will focus on Epidemic protocols and on practical extreme-scale applications which can be built on them.

Keywords: Epidemic Protocols, Gossip-based Protocols, Extreme-scale Computing

Large-scale Distributed Data Mining

Emerging challenges in ubiquitous networks and computing include the ability to extract useful information from a vast amount of data which are intrinsically distributed. Research on Distributed Data Mining (DDM) has focused on the formulation of data mining algorithms for distributed computing environments, where each node processes its local data and contributes to compute a global solution. In many applications the solution is required to be available at every node. This is particularly important when considering applications in networked systems where each node is autonomous and active, like in peer-to-peer systems, mobile ad hoc networks, vehicular ad hoc networks, mobile social networks, wireless sensor networks. Obviously it is also desirable that the solutions at different nodes are identical or within a bounded approximation error.

Keywords: Data Mining, Parallel and Distributed Computing

Bayesian Inference for modelling human decision making

Systems Engineering often involves computer modelling the behaviour of proposed systems and their components. Where a component is human, fallibility can be modelled by a stochastic agent. Bayesian inference can be applied to a set of past decisions to learn a model of decision-making over quantifiable options. The model allows to assess and to predict skilled behaviour such as human expertise in problem solving and decision making. Typical application domains include: students performance monitoring and assessment (intelligent tutoring system and adaptive learning platforms), human operator training such as laparoscopic (minimally invasive) surgery, air traffic controllers, sports and games (e.g. Chess players rating).

Keywords: Statistical inference, Bayes' rule, decision makers models

Intelligent Data Analysis in Bioinformatics

(Quality Assessment of Protein Structure Models)
One of the most important research goals in bioinformatics is the prediction of the three-dimensional structure of a protein, the so-called tertiary structure, from its amino acid sequence. Protein structure prediction has made significant progress over the last decade due to the advancement in the algorithms and the public availability of sequence and structure databases. When many alternative structures predictions are generated for a given sequence, it is important to perform a quality assessment of the prediction models. Estimating the accuracy, or quality assessment, of a prediction model is crucial for a practical use of the model in application domains such as biochemical experimental design, drug design and in biotechnology, for example, for the design of novel enzymes. The aim of the proposed research project is the application of Intelligent Data Analysis to large repositories of amino acid sequences and protein tertiary structures to identify accurate quality assessment methods.

Keywords: Intelligent Data Analysis, Bioinformatics, Protein Tertiary Structure

Mobile and Cloud Computing for Global Data Mining

Emerging challenges in ubiquitous networks and computing include the ability to deploy large scale applications anywhere anytime. Next generation applications will be based on the integration of lightweight mobile devices and on demand storage and computing resources (Cloud Computing). Data Mining applications will obviously play a key role in this scenario. Data captured by smartphones can be stored and processed into a Cloud based expert system for intelligent analysis. The project will focus on the customisation of an open-source Cloud computing toolkits using Cloud Computing Standards and integration of open Data Mining workflow management systems.

Keywords: Data Mining, Cloud Computing, Android

Opinion Leader: fully-decentralised online opinion polls

Surveys of public opinions are typically drawn from a very small sample of the entire population. They also rely on a centralised server or service (e.g., a poll agency). Obvious issues are associated with the centralised nature of this model. Does the (small) sample size provide sufficient guarantees for extrapolating general conclusions? Will a centralised service run under a private administrative control be unbiased and objective? Would the results be available anytime and anywhere without the interference of policy makers and private interests? Decentralised mobile applications do not rely on a server or a service provider: they rely on a voluntarily/collaborative peer-to-peer model. The project will implement "Opinion Leader", a fully decentralised online application for opinion pools. Anyone can start an opinion pool or become the next opinion leader by initiating a viral poll. No one can stop or interfere with a real-time global aggregation of opinions by means of an epidemic communication protocol.

Keywords: Viral Computing, Opinion Polls, Data Mining, Android, Epidemic Protocols

Please do enquire for other topics in other areas, including MapReduce (e.g. Hadoop, Spark), Cloud Computing (excluding privacy and security), Exascale Computing (e.g., fault-tolerance), Autonomic Networking, multi-disciplinary data mining applications (in Bioinformatics, Chemoinformatics, Neuroinformatics and Climate Science).


PhD programme duration and supervision

A PhD programme is expected to last three years (max four). During the first year you will receive close tutoring and guidance. In the first and second years you will supported by weekly meetings in a step by step process that will allow you to:
  • identify the research topic of the project,
  • carry out a literature review,
  • understand the state of the art in the field,
  • define a hot problem, a relevant open issue still to be solved,
  • devise, implement and test a solution,
  • compare the proposed solution with the state of the art,
  • describe the work and the results in scientific terminology and format,
  • submit the results to international conferences and journals for a peer-review selection process,
  • present your work at international conferences.
In the third year you are expected to show independence and initiative by performing such activities more autonomously. At the end of the programme you will submit a final thesis and will defend your work in a viva examination.

Funding your PhD programme

PhD fee and any available scholarships offered by the Department are advertised here.

Other funding opportunities are advertised by the University of Reading:

PhD programme admission application

Either you already have financial support or you still need to find it, you can apply for admission to the PhD programme of the Department of Computer Science. Proof of financial support will be required before the programme can be started.

For any query on the application procedure, please contact pgcompsci@reading.ac.uk.