Abstract Generated abstract
This paper studies random binary tables and the asymptotic behavior of dead-end tests, understood as minimal sets of columns that distinguish all rows. It derives estimates for the probability that a selected set of columns is a dead-end test and uses them to obtain asymptotic formulas for the expected number of such tests in several growth regimes for the numbers of rows and columns. For specified classes of tables, the number of dead-end tests is shown to concentrate around its mean, and corresponding asymptotic results are obtained for two column informativeness measures used in recognition algorithms.
Full Text
UDC 519.95
MATHEMATICS
V. A. SLEPIAN
ON THE NUMBER OF DEAD-END TESTS AND ON MEASURES OF THE INFORMATIVENESS OF A COLUMN FOR ALMOST ALL BINARY TABLES
(Presented by Academician S. L. Sobolev on 2 VII 1969)
- Let \(T_{ln}\) be a binary table having \(l\) rows and \(n\) columns, and let \(T_{lk}\) be a table composed of \(k\) \((k \le n)\) columns of the table \(T_{ln}\).
Definition 1. We shall call the table \(T_{lk}\) a test of the table \(T_{ln}\) if it consists of distinct rows.
Definition 2. A test will be called dead-end if, after the deletion of any column from it, it ceases to be a test. The notion of a test and of a dead-end test is given in paper \((^1)\).
With the table \(T_{ln}\) we associate the quantities \(N(T_{ln})\) and \(N^i(T_{ln})\)—the number of all dead-end tests and the number of dead-end tests into which the \(i\)-th column enters, respectively.
In test algorithms for pattern recognition \((^2)\), the information weight of a column is taken as the measure of informativeness of the column,
\[ I = N^i(T_{ln}) / N(T_{ln}). \tag{1} \]
For the so-called “voting” algorithms, Yu. I. Zhuravlev proposed, as a measure of informativeness, the quantity
\[ J = \sum_{i=1}^{l-1} \sum_{j=i+1}^{l} \left(C_{n-\rho_{ij}}^{k} - C_{n-1-\tilde{\rho}_{ij}}^{k}\right) \Bigg/ \sum_{i=1}^{l-1} \sum_{j=i+1}^{l} C_{n-\rho_{ij}}^{k}, \tag{2} \]
where \(\rho_{ij}\) is the Hamming distance between rows \(i\) and \(j\) in the table \(T_{ln}\), and \(\tilde{\rho}_{ij}\) is the same distance after the deletion of one of the columns.
- We shall now regard \(T_{ln}\) as a random table, namely a table each element of which takes the values 0 and 1 with probability \(1/2\). Then \(N(T_{ln})\), \(N^i(T_{ln})\), \(I\), and \(J\) are random variables.
In the present note, for the class of tables satisfying the relation
\[ n \ll 2^{l^\alpha} \quad (\alpha < 1/2) \tag{3} \]
the asymptotics of the mean (as \(n, l \to \infty\)) of the random variable \(N(T_{ln})\) is found, and for the class of tables satisfying the condition
\[ \log l / \alpha \le n \le l^\beta \tag{4} \]
\[ (\beta < 2,\ \alpha = 1/2 - \lambda_0 / \log l,\ \lambda_0 \to \infty,\ \lambda_0 / \log l \to 0), \]
it is proved that
\[ N/\overline{N} \xrightarrow{p} 1, \qquad \ln / 2 \log l \xrightarrow{p} 1 \qquad (l, n \to \infty). \]
For tables \(T_{ln}\) of arbitrary sizes, provided only that one of the relations
\[ k \ll \sqrt[3]{n}, \qquad l^2 2^{-k} \to \infty, \]
is fulfilled, the formula
\[ Jn/k \to 1 \qquad (l, n \to \infty) \]
is valid.
Here the symbol \(\xrightarrow{p}\) denotes, as usual, convergence in probability.
3. Let us estimate the probabilities of a dead-end test
Consider the table \(T_{lk}\). We give several definitions.
Definition 3. Two unequal rows of the table \(T_{lk}\) form a connection on the \(i\)-th column if, after deletion of the \(i\)-th column, they become equal.
We shall say that the table \(T_{lk}\) has a connection on the \(i\)-th column if at least one pair of rows of \(T_{lk}\) forms a connection on the \(i\)-th column; the table \(T_{lk}\) has a connection on \(m\) columns if it has a connection on each of the \(m\) columns.
Introduce the notation: \(T_{lk}^*, T_{lk}'\) are, respectively, a test and a dead-end test of the table \(T_{lk}\); \(R_l^0, R_l^1\) are, respectively, the set of tests and the set of tests having a connection on the first column; \(r_l^i\) is the set of tests having \(i\) and only \(i\) connections on the first column; \(\mu(A)\) is the cardinality of the set \(A\).
Let \(Q_0\) and \(Q\) be some sets whose elements are, respectively, the tables \(T_{lk}, T_{(l-1)k}\), and let \(\alpha_0\) and \(\alpha_1\) be integers. By the notation \(\alpha_0 Q_0 \to \alpha_1 Q_1\) we shall agree to understand the following: from the \(\alpha_0\) sets \(Q_0\), by deleting one row from each table, one can construct \(\alpha_1\) sets \(Q_1\). For the sets \(r_l^i, r_{l-1}^i, r_{l-1}^{i-1}\) it is not difficult to prove the validity of the relations
\[ 2 i r_l^i \to (l - 2i + 1) r_{l-1}^{i-1}, \tag{5} \]
\[ (l - 2i) r_l^i \to 2(2^{k-1} - l + i + 1) r_{l-1}^i . \tag{6} \]
The introduced notation allows us to write the equalities
\[ R_{l-1}^0 = \sum_{i=0}^{[(l-1)/2]} r_{l-1}^i, \qquad R_l^1 = \sum_{i=0}^{[l/2]} r_l^i . \tag{7} \]
On the basis of (5) and (6) one can always find an integer \(A_l > 0\) such that the representation
\[ A_l(\alpha_i + \beta_i) r_l^i \to m_i r_{l-1}^{i-1} \cup n_i r_{l-1}^i \qquad (i = 0, 1, \ldots, [l/2]), \]
holds, where \(\alpha_i, \beta_i \ge 0\); \(\alpha_i + \beta_i = 1\); \(m_i, n_i\) are positive integers. It is proved, moreover, that \(\alpha_i, \beta_i\) satisfy the equation
\[ \alpha_i \mu(r_l^i) + \beta_{i-1}\mu(r_l^{i-1}) = \frac{\mu(R_l^1)}{\mu(R_{l-1}^0)} \, \mu(r_{l-1}^{i-1}). \]
Hence and from formulas (7) it follows:
Lemma 1. There exist positive integers \(A_l, B_l\) such that
\[ A_l R_l^1 \to B_l R_{l-1}^0 . \]
A consequence of Lemma 1 is
Lemma 2. The probability of a dead-end test satisfies the inequality
\[ p(T_{lk}') \ge p(T_{lk}^*) \prod_{i=1}^{k} p_i, \qquad p_i = 1 - \frac{2^{l-i+1} C_{2^{k-1}}^{\,l-i+1}}{C_{2^k}^{\,l-i+1}} . \]
Let the table \(T_{lk}\) form a test. Fix \(m\) columns.
Definition 4. We shall call a garland a group of \(i+1\) rows that form connections among themselves on \(i\) columns belonging to the \(m\) specified columns.
Definition 5. We shall call a maximal garland on the \(j\)-th column a garland that forms connections on the maximal number of columns among the \(m\) fixed columns, including the \(j\)-th.
Denote by \(X_i^m\) a test \(T_{lk}^*\) having a connection on \(m\) fixed columns, among which the first is included, and by \(Y_{i\max}\) a test \(T_{lk}^*\) containing a maximal garland of \(i+1\) rows on the first column.
Then
\[ p\left(X_l^m/R_l^0\right)=\sum_{i=1}^{m}p\left(Y_i\max X_l^{m-i}/R_l^0\right)\quad (X_l^0\equiv 1). \tag{8} \]
From the assertion of Lemma 1 and the fact that the probability of a join on \(m\) columns does not increase when the number of rows of the table is decreased, the probabilities entering the right-hand side of (8) are estimated from above. As a result we have
\[ p\left(X_l^m/R_l^0\right)\leq p_1p\left(X_l^{m-1}/R_l^0\right)+ \sum_{i=2}^{m}2^{k-1}C_{m-1}^{i-1}i!(i-1)!(l2^{-k})^{i+1}p\left(X_l^{m-i}/R_l^0\right). \]
Further, by the method of mathematical induction, one proves
Lemma 3. The probability of a dead-end test satisfies the inequality
\[ p(T_{lk}')\leq p(T_{lk}^*)p_1^k(1+\delta)^k, \]
where
\[ \delta=\frac{l^3k}{2^{2k}p_1^2}\, \frac{1-(lk'/2^kp_1)^k}{1-lk'/2^kp_1}. \]
A consequence of Lemmas 2 and 3 is
Theorem 1. If \(k^2l^{-1}\to 0\), then
\[ p(T_{lk}')\sim p(T_{lk}^*)\left(1-\exp(-l^22^{-k-1})\right). \]
For those \(k\) for which the condition of Theorem 1 is not satisfied, we give the following upper estimate for the probability of a dead-end test:
\[ p(T'_{l(k+1)})\leq p(\overline{T}_{lk}^*)\,p(T'_{lk})/p(T_{lk}^*), \tag{9} \]
4. The mean number of dead-end tests can be represented by the sum
\[ \overline{N}= \sum_{k=[\log l]}^{\min(n,l-1)}\overline{N}_k, \qquad \overline{N}_k=C_n^k p(T'_{lk}). \tag{10} \]
where \(\overline{N}_k\) is the mean number of dead-end tests of length \(k\) (the length of a test is the number of columns in it). Using Theorem 1 and formulas (9) and (10), one proves
Theorem 2. 1) If \(n\leq l^2(\log l)^\gamma\) \((\gamma<3)\), then
\[ \overline{N}\sim \overline{N}_{k_0}+\overline{N}_{k_1}. \]
2) If \(n\leq 2^{l^\alpha}\) \((\alpha<1/2)\), then
\[ \overline{N}\sim \overline{N}_{k_0} \left( 1+ \sum_{i=1}^{[l^{\alpha_0}]-k_0}(\eta_{k_0}^{-1}a^{(i-1)/2})^i + \sum_{i=1}^{k_0-[\log l]}(\eta_{k_0-1}a^{(i-1)/2})^i \right), \]
where
\[ \overline{N}_k\sim \widetilde{N}_k = C_n^k p(T_{lk}^*)(1-x)^k; \quad x=\exp(-l^22^{-k-1}); \quad \eta_k=\widetilde{N}_k/\widetilde{N}_{k+1}; \]
\(k_0\) is the root of the equation \(\eta_k=1\); \(k_1=k_0-1\) or \(k_0+1\); \(\eta_{k_0}\geq 1\), \(\eta_{k_0-1}\leq 1\); \(0<a<a_0<1/2\); \(0\leq a\leq 1/4\).
In particular: 1) for \(2\log l\leq n\leq l^\beta\) \((\beta<2)\)
\[ k_0=[2\log l-\log(-\ln(n^{1/2}\log l-1))-2]; \]
2) for \(l^\varphi\leq n\leq 2^{l^\alpha}\) \((\varphi>2)\), \(a=1/4\),
\[ k_0=[\,1/2\log(n/l^2)-1/2\log\log(n/l^2)-1/2\,]. \]
Consider the set of tables \(Q_n=\{T_{ln}\}\), where \(2\log l\leq n\leq l^\beta\), \(l\in(l_1,(1+a)l_1)\), \(a>0\).
Theorem 3. In almost all tables of the set \(Q_n\)
\[ \overline{N}\sim \overline{N}_{k_0}\qquad (l\to\infty). \]
5. In the table \(T_{ln}\) we number the columns \((1,2,\ldots,n)\). Consider two tables composed of \(k\) columns of the table \(T_{ln}\). Let each of them contain \(m\) and only \(m\) columns with identical numbers. Denote by \(p(m)\), \(p'(m)\) the probability that both tables simultaneously are tests, dead-end tests respectively. For them the following is valid
Lemma 4. 1) \(p(m)\) is a nondecreasing function of \(m\).
2) \(p'(m) \leq p(m) \leq \exp\{-l2^{2-k}+\beta_m\}\), where
\[ \beta_m= \frac{l^2}{2^{2k-m+1}}+ \frac{l}{2^{m+1}}+ \frac{l^3}{2^{2m+1}} \left( \frac{1}{1-l2^{-m}}+ \frac{2\exp(l\cdot 2^{-m+1})}{1-q} \right) \quad (q<1). \]
3) \(p'(m) \leq p^* \sim p(T'_{lk})(1-x)^k(1-x^{2^m})^{k-m}\) \((k^2l^{-1}\to 0,\ l\to\infty)\).
Denote by \(DN_k\) the variance of the number of dead-end tests of length \(k\). Then, by \((3)\), the formula holds
\[ V_k=\frac{DN_k}{\overline{N}_k^{\,2}} = \sum_{m=\max(0,\,2k-n)}^{k} \frac{C_k^m C_{n-k}^{k-m}}{C_n^k} \frac{p'(m)}{p^2(T'_{lk})} -1. \tag{14} \]
Using Lemma 4, one can show that
Lemma 5. For the class of tables \((4)\), \(V_{k_0}\to 0\) \((l,n\to\infty)\).
In what follows we shall consider only those tables of class \((4)\) for which \(\overline{N}\sim \overline{N}_{k_0}\) \((l\to\infty)\). From Chebyshev’s inequality \((4)\) and Lemma 5 it follows that
Theorem 4. \(N/\overline{N}\xrightarrow{p}1\) \((l,n\to\infty)\).
For the random variable \(N^i(T_{ln})\) one can prove a theorem analogous to Theorem 4, and show that \(\overline{N}^{\,i}\sim 2\log l\,\overline{N}/n\) \((n,l\to\infty)\). Hence, and from Theorem 4, it follows that
Theorem 5. \(I_n/2\log l\xrightarrow{p}1\) \((n,l\to\infty)\).
6. Consider the table \(T_{ln}\), whose dimensions are arbitrary. For the random variable \(J\) the following holds:
Theorem 6. If \(k\ll \sqrt[3]{n}\) or \(l2^{2-k}\to\infty\), then \(J_n/k\xrightarrow{p}1\) \((l,n\to\infty)\).
Institute of Mathematics
Siberian Branch of the Academy of Sciences of the USSR
Novosibirsk
Received
2 VII 1969
References
- I. A. Chegis, S. V. Yablonskii, Trudy Mat. Inst. im. V. A. Steklova AN SSSR, 51, 270 (1958).
- A. N. Dmitriev, Yu. I. Zhuravlev, F. P. Krendelev, Diskretnyi analiz, no. 7, 3 (1966).
- V. A. Slepian, Diskretnyi analiz, no. 12, 50 (1968).
- B. V. Gnedenko, A Course in Probability Theory, Moscow, 1961.