TASK LIST NO. 10: Advanced Statistical Methods
Task 1 (Runs Test - Randomness of a Sample)
A sample of size \(n=20\) of a certain feature \(X\) was taken. The values, ordered according to the sequence of observation (in time), are as follows:
- Determine the median \(m_e\) of this sample.
- Create a sequence of symbols, writing \(a\) if \(x_i < m_e\) and \(b\) if \(x_i > m_e\) (elements equal to the median are omitted).
- Verify the hypothesis of the randomness of the sample (lack of trend/cyclicity) using the runs test at a significance level of \(\alpha=0.05\). (Data from Task 3.87, Krysicki Part II, p. 143).
Task 2
We have two sorting algorithms (A and B). 5 independent time measurements were performed for each of them. The results do not follow a normal distribution (outliers are present).
- Algorithm A: \(12, 18, 14, 15, 13\)
- Algorithm B: \(19, 21, 23, 20, 22\)
Verify the hypothesis that algorithm A is faster than B using the rank-sum test (Mann-Whitney-Wilcoxon).
Task 3
Data on failures depending on the hardware manufacturer were collected in a table (contingency table):
| Manufacturer \ Failure Type | Overheating | Disk Error | Memory Error |
|---|---|---|---|
| Manufacturer X | 20 | 10 | 15 |
| Manufacturer Y | 30 | 50 | 25 |
Check at the significance level \(\alpha=0.05\) whether the type of failure depends on the manufacturer.
Task 4
We are testing the performance of 3 different frameworks (X, Y, Z). Since the data are strongly asymmetric, instead of classical analysis of variance (ANOVA), we use the non-parametric Kruskal-Wallis test.
Performance benchmark results (points):
| Framework X | Framework Y | Framework Z |
|---|---|---|
| 45 | 52 | 68 |
| 48 | 55 | 70 |
| 42 | 58 | 75 |
| 50 | 60 | 72 |
| 46 | 54 | 78 |
| (Sample sizes: \(n_x=5, n_y=5, n_z=5\).) |
For the ranking data from the table, verify the hypothesis that all frameworks have the same median performance.
Task 5
We have two datasets on network traffic (before and after firewall implementation). We want to check if the entire distribution (not just the mean) has changed.
Packet delay measurements (in ms):
Before implementation (\(n=10\)):
(Ordered data: 12, 12, 13, 14, 15, 15, 16, 17, 18, 19)
After implementation (\(m=10\)):
(Ordered data: 18, 19, 20, 20, 21, 22, 23, 24, 25, 28)
Based on the empirical distribution functions of both samples, calculate the \(D_{n,m}\) statistic and verify the hypothesis of identical distributions (Kolmogorov-Smirnov test). Verify the hypothesis at significance level ( \alpha = 0.05 ).
Task 6
We investigate code compilation time (\(Y\)) depending on the number of files (\(X_1\)) and the number of lines of code in a file (\(X_2\)).
Data from 8 software projects:
| Project | Number of files (\(x_1\)) | Thousands of LOC (\(x_2\)) | Compilation time in sec. (\(y\)) |
|---|---|---|---|
| 1 | 2 | 5 | 12 |
| 2 | 4 | 10 | 25 |
| 3 | 3 | 8 | 19 |
| 4 | 6 | 15 | 40 |
| 5 | 8 | 20 | 55 |
| 6 | 2 | 4 | 10 |
| 7 | 5 | 12 | 32 |
| 8 | 7 | 18 | 48 |
For the given data, determine the equation of the regression plane:
Task 7
The number of transistors in processors grows exponentially: \(y = a \cdot e^{bx}\). Having historical data, reduce this problem to linear regression by taking logarithms (\(\ln y = \ln a + bx\)) and determine the growth parameters.
Historical data of processor development (simplified):
| Year (\(t\)) | Time variable \(x = t - 1970\) | Transistor count (\(y\)) |
|---|---|---|
| 1970 | 0 | 2 000 |
| 1975 | 5 | 8 000 |
| 1980 | 10 | 30 000 |
| 1985 | 15 | 120 000 |
| 1990 | 20 | 500 000 |
Task 8
Processor production generates a certain percentage of defects. Instead of taking a fixed sample of 100 units, we take units one by one. After each extraction, we decide: "good batch", "bad batch", or "continue sampling".
Construct a sequential test (Wald test) to verify the hypothesis \(p=0.01\) against \(p=0.10\). Assume Type I error risk \(\alpha = 0.05\) and Type II error risk \(\beta = 0.10\).
Task 9
For a simple sample \(x_1, ..., x_n\) from an exponential distribution (failure-free operation time) with density \(f(x) = \frac{1}{\lambda} \exp\left(-\frac{x}{\lambda}\right)\), where \(\lambda\) is the expected value, determine the estimator of the parameter \(\lambda\) using the maximum likelihood method (MLE).
Task 10
We have 3 servers. We want to check if they operate equally stably (if they have the same variance of response times) before comparing their average times. The sample data is as follows: * Server 1 (\(s_1^2 = 1.4\)): \(n_1 = 20\) * Server 2 (\(s_2^2 = 1.8\)): \(n_2 = 20\) * Server 3 (\(s_3^2 = 1.2\)): \(n_3 = 20\)
Verify the hypothesis \(H_0: \sigma_1^2 = \sigma_2^2 = \sigma_3^2\) at \(\alpha=0.05\) (e.g., using Bartlett's test).